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ABSTRACT 

This report is aimed at helping Congress: (1) better 
understand the functions, history, capabilities, limitations, uses, 
and misuses of educational tests; (2) learn more about the promises 
and pitfalls of new assessment methods and technologies; and (3) 
identify and weigh policy options affecting educational policy. To 
prepare this veport, the Office of Technology Assessment (OTA) 
examined technological and Institutional aspects of educational 
testing. This summary and policy chapter synthesizes the OTA's 
findings and outlines options for congresjional action. The analysis 
and discussion are framed in terms of the functions of testing. The 
OTA concludes that examining the capability of various tests to meet 
specific objectives is the necessary first step in resolving the 
controversy over testing in American schools. Proposals now before 
Congress could fundamentally alter testing in American education. 
Three such issues are: (1) proposals for national testing; (2) 
changes to the National assessment of Educational Progress; and (3) 
revisions to the Chapter 1 provisions for educationally disadvantaged 
children. Recommendations on these issues are summarized. The 
discussion is supplemented by eight sidebars (Boxes A through H) 
highlighting specific testing issues. Three figures and six 
photographs are included. (SLD) 
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Foreword 



Education is a primary concern for our country, and testing is a primary tool of education. 
No other country tests its school children with the frequency and seriousness that characterizes 
the United States. Once the province of classroom teachers, testing has also become an 
instrument of State and federal policy. Over the past decade in particular, the desire of the 
Congress and State Legislatures to improve education and evaluate programs has substantially 
intensified the amount and importance of testing. 

Because of these developments and in light of curren t research on thinking and learning, 
Congress asked OTA to provide a comprehensive report on educational testing, with emphasis 
on new approaches. Changing technology and new understanding of thinking and learning 
offer avenues for testing in different ways. These new approaches are attractive, but inevitably 
carry some drawbacks. 

Too often, testing is treated narrowly, rather than as a flexible tool to obtain information 
about important questions. In this report, OTA places testing in its historical and policy 
context, examines the reasons for testing and the ways it is done, and identifies particular ways 
Federal policy affects the picture. The report also explores new approaches to testing that 
derive from modem technology and cognitive research. 
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Summary and Policy Options 



The American educational system is unique. 
Among the first in the world to establish a commit- 
ment to public elementary and secondary schooling 
for all children, it has achieved an extraordinary 
record: enrollment rates of school-age children in the 
United States are among the highest in the world, 
and over 80 percent finish high school in some form 
between the ages of 18 and 24. 1 This tradition of 
education for the masses was nurtured in a system 
that, by all outward appearances, is complex and 
fragmented: 40 million children enrolled in some 
83,000 schools scattered across some 15,000 school 
districts. Pluralism, diversity, and local control — 
hallmarks of American democracy— distinguish the 
American educational experiment from others in the 
world. 

Student testing has always played a pivotal role in 
this experiment. Every day millions of school 
children take tests. Most are devised by teachers to 
see how well their pupils are learning and to signal 
to pupils what they should be studying. Surprst 
quizzes, take-home written assignments, or?l pre- 
sentations, pretests, retests, and end-of-year compre- 
hensive examinations are all in the teacher's tool- 
box. 

It is another category of test, however— originating 
outside the classroom, usually with standardized 
rules for scoring and administration — that has gar- 
nered the most attention, discussion, and contro- 
versy. From the earliest days of the public school 
movement, American educators, parents, policy- 
makers, and taxpayers have turned to these tests as 
multipurpose tools; yardstick of individual progress 
in classrooms, agent of school reform, filter of 



educational opportunity, and barometer of the na- 
tional educational condition. 

Commonly referred to as 4 4 standardized tests," 2 
these instruments usually serve management func- 
tions; they are intended to inform decisions made by 
people other than the classroom teacher. They are 
used to monitor the achievement of children in 
school systems and guide decisions, such as stu- 
dents 9 eligibility for special resources or their 
qualification for admission to special school pro- 
grams. Children's scores on such tests are often 
aggregated to describe the performance of class- 
rooms, schools, districts, or States. With technology 
cal advances, these tests have become more reliable 
and more precise, and their popularity has grown. 
Ibday they are a fixture in American schools, as 
common as books and classrooms; standardized test 
results have become a major force in shaping public 
attitudes about the quality of American schools and 
the capabilities of American students. 

Testing at a Crossroads 

Ifests designed and administered outside the 
classroom are given less frequently than teacher- 
made tests, but they are thoroughly entrenched in the 
American school scene and their use has been on the 
rise. One indicator of growth is sales of commer- 
cially produced standardized tests. Revenues from 
sales of tests used in elementary and secondary 
schools more than doubled (in constant dollars) 
between 1960 and 1989 (sec figure 1), a period 
during which student enrollments grew by only IS 
percent. 3 The rise in testing reflects a heightened 
demand from legislators at all levels — and their 
constituents — for evidence that education dollars 



'For current data comparing l >rimary and secondary school enrollment rates In the United States and other countries, see U.S. Department of 
Education, National Cen v for Education Si$At^a t Digest cf Education Statistics, 1990 (Washington, DC: February 1991), p. 380; and George Madaus, 
Boston College, and Thomas Kellaghan, St. Patricks College, Dublin, "Student Examination Systems in the European Community: Lessons for the 
United States," 0IA contractor report, June 1991. For a thorough analysis of completion and dra^da^ 

Center for Education Statistics, Dropout Rates in the US: 1989 (Washington, DC: September 1990). With respect to postsecondary education, as well, 
participation rates of American high school graduates are the highest in the world: dote to 60 percent of persons of college-going age were enrolled in 
postse^ndary institutions in 

For details see Kenneth Redd and Wayne Riddle, Congressional Research Service, "Comparative Education: Statistics on Education in the U.S. and 
Selected Foreign Nations," 88-764 BPW, Nov. 14, 1988. 

*Ifesting terms have both technical and common meanings, and often cause confusion. Box A is a glossary of words used in this report, and will help 
the reader understand the precise meanings of these words. 

3U.S. Department of Education, Digest of Education Statistics, 1990 % op. cit, footnote 1 , p. 12. The fact that testing grew proportionally mora rapidly 
than the student population suggests that policymakers may have responded to increased enrollasuts by attempting to institute greater administrative 
O Micy in the schools. As discussed inch. 4, this is a familiar historical trend. 
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Figure 1— Growth In Revenues from lest Sales and 
In Public School Enrollments, 1960-89 
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Figure 2— Shifts In Federal, State, and Local 
Funding Patterns for Public Elementary and 
Secondary Schools, Selected Years 
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mana Simora (ad.), Tha Bowki* Annual (Hmt York, NY: Raad 
Publishing, 1970-1990). Enrollmartdata from U.S. Dapartmant 
of Education, National Center for Educational Statistics, Dfessf 
of Education Statistic, f 000 (Washington, DC:Fabruary 1991), 
p. 12. 

are spent effectively* Holding schools and teachers 
4 4 accountable 9 9 has increasingly become synony- 
mous with increased standardized testing. 

State and local governments have traditionally 
assumed the greatest share of elementary and 
secondary education funding, as shown in figure 2. 
State funding began to exceed local funding as a 
percentage of the total starting in the mid-1970s, and 
State-mandated testing grew accordingly; 46 States 
had mandated testing programs in 1990 as compared 
to 29 in 1980. 4 Similarly, increases in Federal 
education spending during the 1960s and 1970s 
spurred increases in testing as Congress sought data 
to evaluate Federal programs and monitor national 
educational progress. The Federal Government cur- 
rently spends jver $20 billion per year on ele- 
mentary and secondary education in programs ad- 
ministered by over a dozen Federal agencies. 5 



SOURCE: U.S. Dapartmant of Education, National Center for Education 
Statistics, Digast of Educational Statistics 1990 (Washington, 
DC:Fabruary 1991). 

Outcome-based measures of the effectiveness of 
educational programs — generally achievement test 
sc ~es — have become key elements in the congres- 
sional appropriations and authorization process. 

Contradictory demands for revaluation of testing 
have been caught up in recent school reform 
initiatives. On the one hand, many teachers, admin- 
istrators, and others attempting to redesign curricula, 
reform instruction, and improve learning feel sty- 
mied by tests that do not accurately reflect new 
educational goals. On the other hand, most leading 
educational measurement experts emphasize that 
conventional standardized tests are useful tools in 
gauging the strengths, weaknesses, and progress of 
American students. 

Motivated in part by changing visions of class- 
room learning and by frustration with tests that many 
critics claim can hinder children's progress toward 
higher levels of achievement, many educators are 
turning to changed methods of testing. Some of these 
methods are modifications of conventional written 
tests; others are bolder innovations, requiring stu- 



401A data on State testing practices, 1985 and 1991. 
q 5 U.S. Department of Education, Digest of Education Statistics, 1990, op. cit., footnote 1, p. 337. 
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Summary and Policy Options • 3 



Box A— -A Glossary of Testing Terminology 

A test score is on estimate. It is based on sampling what the test taker knows or can do. For example, by asking 
a sampk of questions (dmwnfremtf 

knowledge, skills, or behavior. Achievement tests are intended to estimate what • student knows and can & ma 




specific subject as a result of schooling. Achievement tests and aptitude tests are both instruments that 
aspects of an individual's developed abilities; they exist on a continuum, with the fanner being more ctoeejy get 
to specific curricula and school programs and the latter intended to capture knowledge acquired bom in and out of 

school 

Standardized tests are administered and scored under conditions on^ 
associate standardized tests with the multiple-choice format, it is i mp o rta nt to em pnasiae that standardization Jl • 
generic concept that can apply to any testing format— from written essays to oral examinations to producing a 
portfolio. Standardization is needed to make test scores comparable and to assure as much as possible that test takers 
have equal chances to demonstrate what they know. 

The word standards applied to teste has at least two djnYwentineanin^ 
goals, desirable behaviors, or models to which students, teachers, or schools should aspire. Such staadajd»44teribe 
what optimal performance looks like and what isdesh^tftt^ 
of Teachers of Mathematics 1 as determined masesreitdn£&f^ 

as problem solving. The word standards, in its more technical meaning, denotes the specific levels of proficiency 
that students are expected re attain. Thus, setting a passing score for a test is equivalent re setting a standard of 
performance on that test 

Because they are based on samples of behavior, tests are necessarily imprecise: scores can vary for reasons 
unrelated to the individual's actual achievement Test scores can only describe what skills have been mastered, but 
they cannot, alone, explain why learning has occurred, or prescribe ways to improve it The fact mat achievement 
is affected by schooto,paferes ) «W^ 
schools and programs. Test scores must be interpreted carefully. 

Reliability refers to the consistency and generalizability of test data. Will a student's score today be close (if 
not identical) to her score tomorrow? Do tto 

of skills? If tests are scoredby human judges, to what extent do differs judges agree m their estimations of student 
achievement? A test needs to demonstrate a high degree of reliability before it is used to make decisions, particularly 
those with high stakes attached. 

WUUty refers to whether or not a test measures what it is supposed to measure, and whether appropriate 
inferences can be drawn from test results, "validity is judged from many types of evident*, including, in the views 
of some experts, the consequences of t ranslating test-based jp fr ssnoos into dselskBS or policies that can affect hK ti- 
viduaU or institutions. An acceptable 

Them are two basic ways of interpreting student performance on rests. One is to describe a student's test 
performance as it compares to that of other students (e.g., he typed better than 90 percent of his classmates). 
Norm-referenced tests are designed to make mis type of contparison. The other method is to describe the skills or 
performance mat the student denionstn 

tests are designed to compare a student's test performance to clearly defined learning tasks or skill levels, 

Performance assessment refers to testing methods mat require students to create an answer or product mat 
demonstrates their knowledge or skills. Performance assessment can take many different forms including writing 
short answers, doing inathematical cotnowations, writing an extended essay, conducting an experiment, presetting 
an oral argument or assembling a portfolio of representative work. 

Constructed-response items are one kind of performance assessment consisting of open-enoVd < 
on a conventional test However, thwreo^ students to produce the solution to a question rather man tor 
an array of possible answers (as nwltipleHdwice items do). 

Computer-administered testing fc a generic term covering any test that is taken by a student 
computer. A special type of computer-adniinistered testing is computer-adaptive wrrfna, whicn 
computer's memory and bianctm^ capabilities 
tert taker as the test is titan. 

SOURCE: Oflfce of Tbctaotofy Am*mm> 1992. 





4 » Testing in American Schools: Asking the Right Questions 




Photo endhi Bob Dmmmrieh 

^ rt c^ r rwSif?^ ,ed ,5!t te8 Jrtandardized acWevenwrt tests several times during elementary and secondary school. 
Standardized test results have become a major force In shaping public attitudes about the quality of American school 

and the capabilities of American students. 



dents to demonstrate their knowledge and skills 
through methods known as "performance assess- 
ment." Computer technologies, video, and inte- 
grated multimedia systems add capabilities and 
richness not usually attainable from conventional 
tests, and are gaining ground in assessment as well 
as instruction. 

These new approaches to testing have been fueled 
by some cognitive scientists who claim that complex 
thinking involves processes not easily reduced to the 
routinized tasks required on conventional tests. A 



recent report on science education, for example, 
argued that: 

Rather than mastering concepts, students believe that 
recognizing terms in a multiple-choice format is the 
appropriate educational goal. In the long ran the 
impact of anient modes of testing on enduring skills 
and strategies for learning will be inimical to re- 
form. 6 

In contrast, many testing professionals maintain 
that school improvement efforts must be constructed 
on a solid foundation of information about what 



^•donal Research Council, F ulfilling the Promise: Biology Education in the Nation' sSchoe is (Washington, DC: 1990), p. 44. Another recent report 
concluded thtl: . .. to direct letting «long ■ more constructive course, we must draw on richer drect evidence of knowledge aod skill from lnfomuSon 
wurceii beyond multiple : choice tests." See National Commission on Testing and Public Policy, from Gatekeeper to Gateway: Transforming Testing 

hys,Wl»ts,aiMlWhin^, ,, l»Wi)</Mir<v/M»,vol.70,No.9,Mv*989. 
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students are learning; well-designed tests, they say, 
if used and interpreted properly, can provide invalu- 
able information in a reliable, consistent, and 
efficient fashion. For example, standardized tests 
can inform policymakers by supplying trend data on 
the skill levels of American students. Recent analy- 
sis of data from the Iowa Tests of Basic Skills 
revealed that student performance improved be- 
tween 1979 and 1985, even on test items designed to 
assess certain higher order skills, contradicting 
findings from other test data that improvements were 
limited to mechanical tasks. 7 

Measurement experts contend that these standard- 
ized tests are also useful to teachers, as tools to 
calibrate classroom impressions of student progress; 
they are viewed as one relatively efficient, albeit 
inexact indicator of how a given child or school 
system is progressing relative to students nation- 
wide. One test author expressed a view shared by 
many others in the testing community: 

. . . comprehensive, survey-type standardized achieve- 
ment tests have served a useful function in monitor- 
ing the achievemer : levels of individual pupils and 
the aggregate groupings of these students in terms of 
classrooms, buildings, and the district. . . , 8 

Common Ground 

Tb outsiders listening in on this debate, it may 
appear mat proponents of conventional and new 
forms of assessment are adversaries locked in an 
intractable stalemate. Closer inspection, however, 
reveals that testing policy is not a zero-sum game in 
which either existing testing or new methods win, 
but an arena with multiple and mutually compatible 
choices. 

The trick is using the kind of test that is best 
suited to providing the desired type of informa- 
tion. Thus, although some activists in the debate 
have carved extreme positions, most others agree 
on at least these two fundamental points: 

• different forms of testing can, if used cor- 
rectly, enrich our understanding of student 
achievement; and 



• tests of any kind should be used only to serve 
the functions for which they were designed 
and validated. 

On this common ground it may be possible to 
build genuine reform. One prominent psychologist 
and long-time participant in the politics and science 
of testing, commenting on what appears to be a rare 
opportunity, observed that: "... our testing ecology 
is entirely manmade; what we made we can 
change^ 9 

Lessons of History 

But history tempers the optimism. Since the birth 
of mass public education in America some 150 years 
ago, innovation in tests and testing has been most 
attractive during periods of heightened public anxi- 
ety about the state of the schools. During these 
periods, however, legislators and school officials 
feel the greatest pressure to act, and are most prone 
to rely on existing tests as levers of policy. Thus, 
researchers and policymakers involved in the pains- 
taking process of curricular reform and new test 
design often find themselves at odds with those who 
demand quicker and more immediately noticeable 
action. Hence (as described in detail in chapter 4 of 
the full report), tests have too often been used to 
serve functions for which they were not designed or 
adequately validated. Within the education policy 
and research community, therefore, there is an 
undercurrent of concern that new tests will, as in the 
past, be implemented before they have been vali- 
dated and before their effects on learning can be 
understood. 

For some educators the principal concern is that 
new tests will raise new barriers— to women, people 
of color, other minorities, and the economically 
disadvantaged. On these issues, too, caution flags 
are up: precisely because testing has historically 
been viewed as a means to achieve educational 
equity, tests themselves have always been scruti- 
nized on the question of whether they do more to 
alleviate or exacerbate social, economic, and educa- 
tional disparities (see box B). 



7 See Elizabeth Witt, Myuoghee Han, and HD. Hoover, "Recent Trend* in Achievement Tbsts Scorer Which Student! are Improving and on What 
Level! of Skill Complexity? ' ' paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA, 1990. See alio 
Robert Linn and Stephen Dunbar, "The Nation'! Report Card Goes Home: Good News and Bad About Trends in Acbtarement," Phi Delta Kappan, 
voL 72, No. 2, October 1990, p. 132. For a thorough analysis of treads in achievement that illustrates the importance of using multiple measure! of 
performance, ace Daniel Koretz, Trends in Educational Achievement (Washington, DC: Congressional Budget Office, 19S6). 

•Herbert Rudman, "Hie Future of lesting is Now, ' ' Educational Measurement: Issues and Practice, vol. 6, No. 3, fall 1987,p. 6. 

'Sheldon White, professor of psychology, Harvard University, personal communication, June 1991. 
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Box B— Equity, FairiwfS, and Educational Testing 
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The Purpose of This Report 

Federal policymakers are caught in an unenviable 
dilemma. On the one hand they must satisfy the 
growing demand for accountability, which is often 
expressed in terms of simple questions: Do the 
schools work? Are students learning? On the other 
hand, they must also be responsive to growing 
disaffection with the quality of data on which 



administrators rely for evaluations of programs: 
achievement scores are rough indicators, at best, of 
progress in attaining the many goals of federally 
funded programs. Not surprisingly, Federal evalua- 
tion requirements that place additional testing bur- 
dens on grantees and program participants often spur 
an interest in revising those very requirements. 10 As 
the Federal Government has become a more promi- 
nent player in elementary and secondary education, 



lopor example, the Department of Education recently formed a task force to look into problems of testing and evaluation for the Chapter lATitie I 
^ compensatory education program. See ch. 3 of this report. 
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and as the public's attitudes toward concepts of 
national educational goals and standards have evolved, 
Congress has become more involved in the testing 
debate. 11 

Congress has a stake in U.S. testing policy for 
three main reasons: 



• to ensure that accurate and reliable data about 
American educational achievement are pro- 
vided to lawmakers, program administrators, 
parents, teachers, test takers, and the general 
public; 

• to ensure that the tests used to evaluate Federal 
education programs do not, in themselves, 



1 1 A 1989 Gallup poll found that the majority of respondents supported the idea of national achievement standards and goals, but few supported either 
State or Federal intervention in the definition of those standards and goals. For discussion sec George Madaus, Boston College, and Thomas Kellaghan, 
St. Patricks College. Dublin, * 'Examination Systems in the European Community: Implications foraNational Examination System in the United States * * 
OTA contractor report, April 1991. 

ERJC 7-934 - 92 - 2 * 
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impede progress tovard program goals; and 

• to ensure that tests are used fairly and do not 
infringe on individual rights or impose unac- 
ceptable social costs. 

Congress faces a variety of decisions that could 
have significant and long-term effects on the scope, 
quantity, and quality of testing in the United States. 
Issues related to national testing and the role of tests 
in Federal education programs are already on the 
congressional agenda; issues regarding the rights of 
test takers may emerge, as they have in previous 
times, if new national and State tests are mandated 
or if the stakes attached to existing tests are raised. 

This report is aimed at helping Congress: 

• better understand the functions, history, capa- 
bilities, limitations, uses, and misuses of educa- 
tional tests; 

• learn more about the promises and pitfalls of 
new assessment methods and technologies; and 

• identify and weigh policy options affecting 
educational testing. 

lb unravel the complexities of these topics, OTA 
examined technological and institutional aspects of 
educational testing. This summary and policy chap- 
ter synthesizes OTA's findings on tests and testing, 
and outlines options for congressional action. In the 
full report, chapter 2 examines recent changes in the 
uses of testing as an instrument of policy, chapter 3 
covers current issues affecting the role of the Federal 
Government in educational testing, chapter 4 re- 
views the history of testing in the United States, and 
chapter 5 considers lessons from testing in selected 
European and Asian countries. The final three 
chapters focus on the tests themselves. Chapter 6 
explains characteristics and purposes of existing 
educational tests, and examines the reasons new test 
designs seem warranted. Chapter 7 explores various 
approaches to performance assessment and how these 
methods are being implemented in schools, and chapter 
8 examines the current and future roles of computers 
and other information technologies in assessment 

In this report, the analysis and discussion are 
framed in terms of the functions of testing. OTA 
concludes that examining the capability of various 
tests to meet specific objectives is the necessary first 
step in abating the seemingly endless controversy 



over the quantity and format of testing in American 
schools, and in laying the groundwork for new 
approaches. 

The Functions of Testing 

Educational tests have traditionally served many 
purposes that can be grouped into three basic 
functions: 

• to aid teachers and students in the conduct of 
classroom learning; 

• to monitor systemwide educational outcomes; 
and 

• to inform decisions about the selection, place- 
ment, and credentialing of individual students. 

These three functions have a common feature: 
they provide information to support decisionmak- 
ing. However, they differ in the kinds of information 
they seek and the types of decisions they can 
support, and test results appropriate for some deci- 
sions may be inappropriate for others. 

Classroom Feedback for Students 
and Teachers 

Teachers must constantly adapt to the behaviors, 
learning styles, and progress of the students in their 
classrooms. 12 Tests can help them organize and 
process the steady stream of data arising from 
classroom interactions. Just as physicians use body 
temperature, blood pressure, heart rate, x rays, and 
other data to form an image of the patient's health 
and to determine appropriate treatments, teachers 
can use data of various types to better manage their 
classes and, in some circumstances, to tailor lessons 
to the specific needs of individual students. Students 
can use information to gain sharper understanding of 
their strengths and weaknesses in different subjects 
and can adjust their study time accordingly. 

Tests tliat can aid classroom instruction and 
learning need to: 

• provide detailed information about specific 
skills, rather than global or general scores; 

• be linked to content that is taught in the 
classroom; 

• be administered frequently; 

• give feedback to students and teachers as 
quickly as possible; 




l2 Fora recent analys is of die internal workings of classrooms and implications for education policy, gee Edward Pauly, The Classroom Crucible: What 
Really Works, What Doesn't, andWhyQiwYatk, NY: Basic Books, I99l) r especially ch. 4. 

20 
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A student In 1943 takes her oral spelling examination after 
completing a written examination on the blackboard. 

Teachers have always used a variety of tests to help them 
manage their classes and evaluate student progress. 

e be scored or graded to help students learn from 
their errors and misunderstandings, and help 
teachers intervene when students get stuck; and 

e be based on clear and open criteria for scoring 
so that students know what to study and how 
they are being evaluated. 

System Monitoring 

How well is a school or school system perform- 
ing? This is a question often posed from the outside, 
by parents, legislators, and others with particularly 
high stakes in the answer. As shown in chapters 2 
end 4 of the full report, the question is usually posed 
with more urgency when the impression is that the 
answer will be 4 'not very well." 

Educational tests of various sorts have long been 
viewed as objective instruments capable of provid- 



ing systematic and informed answers about the 
learning that takes place in schools. In an educa- 
tional system as decentralized and diverse as the 
American one, there is a nearly insatiable appetite 
for evidence that all schools are providing children 
with a decent education. Since the mid- 19th century, 
tests have been used to determine how much 
students in different schools or school districts were 
learning. Recent increases in Federal expenditures 
have stimulated new demands for system accounta- 
bility. 

Tbst scores alone cannot reveal how or why 
learning has occurred, or the degree to which 
schools, parents, the child's home background, or 
other factors have affected learning. When com- 
bined appropriately with other data, however, such 
as prior test results and children's socioeconomic 
status, test results can help explain — as well as 
describe— the outcomes of schooling. 13 

For tests to yield meaningful comparisons across 
schools and districts, they must: 

• be uniformly and impartially administered and 
scored; and 

e meet reasonable standards of consistency, fair- 
ness, and validity. 

In addition, to be useful system monitoring tools, 
these tests: 

• should provide general information about 
achievement, rather than detailed information 
on specific skills; 

• should describe die performance of groups of 
students— classrooms, schools, districts, or 
States— rather than individuals (thereby allow- 
ing the use of sampling methods that yield the 
desired information without the costly testing 
of every student); and 

e can be administered infrequently (once or twice 
a year at the most). 

Selection, Placement, Credentialing 14 

Tfests designed to provide data about individual 
students' current achievement or predicted perform- 



upor example, recent analyst! of data from close to 1,0001011001 districts in Tfexas found significant differences in student achievement scores that 
could be explained by variations in measures of teacher quality and other inputs. St - Ronald Ferguson, "Payinft for Public Education: New Evidence 
on How and Why Money Matters, 11 Harvard Journal on Legislation, vol. 28, No. 2 summer 1991, pp. 465-498; and Richard Mumane, 4 'Interpreting 
the Evidence on 'Does Money Matter?' " Harvard Journal on Legislation, vol. 28, No. 2, summer 1991, pp. 457-464. 

"These three terms overlap. However, selection refers primarily to decisions about a student 1 s qualifications for admission to schools; placement refers 
to decisions about qualifications of students to participate in programs within schools they attend; and credentialing (or certification) refers to decisions 
regarding proficiencies reached by students who have participated in programs or completed courses of study. 
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ance can be used for individual selection, placement, 
or credentialing decisions. This function of testing 
has a long historical tradition: the earliest recorded 
examples are Chinese civil service qualifying tests 
given in the 2nd century B.C. As discussed in greater 
detail in chapter 5 of the full report, many European 
and Asian countries continue to use examinations 
primarily for professional and educational "gate- 
keeping" functions, such as certifying students as 
qualified to attend specialized or elite public educa- 
tion programs. 

Placement and certification decisions are still 
quite commonly based on tests, even in elementary 
and secondary education. Minimum competency 
examinations are required in many States for high 
school graduation, for promotion from one grade to 
the next, or for placement in remedial or gifted 
programs; 15 Advanced Placement examinations are 
used to determine whether high school students will 
be given college credit and placed in advanced 
courses when they arrive at college; and the National 
Ibacher's Examination is necessary for teacher 
licensing in 35 States. 

In the United States, however, the use of tests for 
selective admissions decisions has been more lim- 
ited than in most other countries. 16 It is rather at the 
end of high school, when students compete for 
admission to colleges and universities, that selection 
tests play a critical role. 17 

Some recent proposals to initiate new tests at the 
national level include provisions for placement and 
certification. One such proposal calls for a "certifi- 
cate of initial mastery/' to be issued to graduating 
high school students who perform at prescribed 



levels on the test, and for examinations as certifica- 
tion criteria for completion of fourth and eighth 
grades. 18 

In contrast with tests used for system monitoring, 
tests used for selection, placement, or certification 
decisions must: 

• provide individual student scores; 

• meet particularly high standards of comparabil- 
ity, consistency, fairness, and validity; 

• provide information that is demonstrably rele- 
vant to successful performance in future school 
or work situations (in the case of selection 
tests); and 

• provide information that is demonstrably rele- 
vant to the identification of children with 
special needs (in the case of placement tests 
used for gifted and talented programs, remedial 
education, or other special K-12 situations). 

These tests are similar to system monitoring tests 
with respect to the need for impartial scoring, 
standardized administration, generality of informa- 
tion, and frequency of testing* 

Some proposals for a new national test or system 
of examinations have selection or certification as a 
principal function. Good tests for these purposes 
must undergo intensive and time-consuming devel- 
opment as well as careful empirical evaluation* They 
must be carefully and clearly validated for these 
intended purposes. Historically, tests used for these 
purposes have been the most subject to legal 
challenges and scrutiny (see chs. 2 and 4 in the full 
report). 



"Itere is widespread concern about tests being used as the principal basis for placement of children into special programs, such as "gifted and 
talented 0 or remedial. "A major problem is getting students who obviously need it into either gifted or remedial programs when they do not meet the 
'required ' minimum or maximum score on the tests [to qualify for S tate funding] ," said Jack Webber, a sixth grade teacher in Redmond, WA (personal 
communication, September 1991 ). Precise data on the numbers of schools or districts that rely on tests for these purposes, and on exactly how test data 
enter into those decisions, are difficult to find. Recently the New York State Commissioner of Education struck down the use of achievement tests as 
the sole screening criteria for placement of students in "enriched" programs. See also discussion in ch. 2. 

k *The situation has changed since the turn of the century, when, e.g., "... a student could not be admitted to Central [High School] without 
demonstrating academic competence on an entrance exam, ..." See David Labaree, The Making of an American High School: The Credentials Market 
and the Central High School of Philadelphia, 1838-1939 (New Haven. CT: Yale University Press, 1988), p. SO. This was not a phenomenon limited 
to the East Coast: rural students in Michigan and elsewhere in the Midwest needed to pass entrance examinations to gain admissions Into urban high 
schools. Since that time, however, policies of selective admissions into public high schools have disappeared in all but a handful of special Institutions, 
such as the Bronx High School of Science in New York. 

"Over 3,000 colleges and universities use the Scholastic Aptitude Tfcst (SAT) or American College Tfest (ACT) to aid in their selection from vast 
numbers of applicants, and recruits take the Armed Services Vocational Aptitude Battery (ASVAB) for placement within the military. Many private 
elementary and secondary schools use tests as * criterion for admission. 

"For a summary of national testing proposals as of early 1 991 , see James Stedman, Congressional Research Service, 1 •Selected National Organizations 
Concerned With Educational Ibsting Policy," memorandum, Feb. 8, 1991 . For a more recent update and discussion of the central issues, see "National 
Ibsting: An Overview," Youth Policy, vol. 13, Nos. 4-5, special issue, September 1991, pp. 29-35. For a critique of these proposals see also Madaus 
and Kellaghan, op. cit., footnote 11. 
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The United States ranks high In the work) In terms of the percentage of the population graduating from high school. These students 
were photographed during their 1991 graduation ceremony at Wbodrow Wilson High School, a large public high school In the 
District of Columbia. During the 1970s and 1980s many States Instituted minimum competency testing 

as a criterion for graduation. 



Raising the Stakes 

In theory, educational tests are unobtrusive instru- 
ments of estimation. A major sticking point in any 
discussion of testing, however, is whether, in 
practice, testing affects the behavior it is intended to 
measure. In the current debate, advocates of new 
ways to test often argue that since tests can play a 
powerful role in influencing learning, they must be 
designed to support desired educational goals. These 
advocates disparage "teaching to the test" when a 
test calls for isolated facts from a multiple-choice 
format, but endorse the concept when the test 
consists of "authentic" tasks. For these efocators, 
one of the main criteria for a ' 'good' ' test is whether 
it consists of tasks that students should practice. 

More traditional measurement theorists, on the 
other hand, are skeptical about the value of teaching 
to the test because of the need to obtain valid and 
reliable information about the whole domain of 
knowledge, not just the sample of tasks that appears 
on the test. Thus, they argue that, regardless of a 
test's format, test scores are meaningless if students 
have practiced the tasks. 

The core of the often shrill debate reflects 
positions on two central questions: 

ERIC 
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• Do conventional standardized tests designed to 
estimate student achievement negatively influ- 
ence instruction and learning? 

• Do new testing methods designed to guide 
instruction and learning accurately estimate 
student achievement? 

Tests and Consequences 

As the Nation's use of standardized tests has 
increased, the consequences attached to test results 
have become more serious. All but four States have 
standardized testing programs* Ifest scores are ap- 
plied to a wide array of decisions affecting individ- 
ual children, schools, and school systems. Students 
who have taken college entrance examinations, high 
school juniors who have failed State minimum 
competency tests, schools that have become lures in 
real estate advertisements, and States that have 
found themselves ranked in the national media by 
their average test scores are likely to remember the 
event — and its consequences — long afterwards. 

Many educators, extrapolating from their experi- 
ences in classrooms as students or as teachers, 
contend that tests influence students and teachers 
only if they perceive that important consequences 

?3 
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are linked to test results. 19 But a fundamental 
problem arises when important consequences, or 
high stakes, are attached to test results; and not 
surprisingly, the increase in high-stakes testing over 
the past two decades has brought a concomitant rise 
in controversy, lb understand the problems that can 
arise from high-stakes testing it is useful to consider 
a familiar medical metaphor. 

Fever thermometers are used to measure body 
temperature without influencing that temperature; 
they provide information that could lead to treatment 
of the underlying conditions suspected of causing 
the fever. Similarly, well-designed educational tests 
can provide useful information to help students, 
teachers, or even school systems. Ibachers can use 
tests to gauge their students' progress and decide 
how to 4 4 treat M children who are not doing well; 
students (in the upper grades especially) can review 
their test results to see whether they are learning the 
material and to determine how they might learn it 
more effectively; and State funding authorities can 
use information on the relative progress of students 
in different schools to develop responsive educa- 
tional strategies. Thus, the information from tests 
can be used to choose appropriate educational 
"treatments." 

Suppose, however, that patients were punished for 
running a high fever (or rewarded for a low one), or 
that doctors were rewarded for bringing down their 
patients' fever (or penalized if the fever remained 
high). They could easily take actions — cold show- 
ers, aspirin, a glass of cold beer-- to "cure" the 
symptom but not necessarily the underlying illness. 
More comprehensive and appropriate treatment 
could be delayed or skipped. Just as temporary drops 
in body temperature could give misleading indica- 
tions of changes in health status, fluctuations in 
scores from high-stakes educational tests may not 
reflect genuine changes in achievement. When 
stakes are high, a heavy emphasis is sometimes 



placed on specific test results, and especially on 
increasing scores. The symptom — low test scores — 
is treated without affecting the underlying condition — 
low achievement. 

An instructive lesson about the mixed effects of 
high-stakes testing comes from the minimum com- 
petency testing (MCT) movement of the 1970s and 
1980s (see box C). As described also in greater detail 
in chapter 2 of the fill 1 , report, many State legislatures 
pegged promotion, placement, and graduation re- 
quirements to performance on criterion-referenced 
tests. The underlying rationale was that extrinsic 
rewards and sanctions would induce students to 
learn the relevant material more diligently and 
heighten teachers' motivation to ensure that all 
students learned the basics before moving them 
ahead. It now appears that the use of these tests 
misled policymakers and the public about the 
progress of students, and in many places hindered 
the implementation of genuine school reforms. 

Mow recent research seems to confirm that 
high-stakes testing can mislead policymakers. 20 
Complicating this picture, however, is other prelimi- 
nary research evidence suggesting that students may 
underperform on tests that bear no individual 
consequences at all. 21 If such distortions are occur- 
ring, they may be misleading policymakers and the 
general public into believing the schools are in 
worse shape than they really are (and into blaming 
the school system for a long list of social and 
economic problems 22 ). The fine-tuning knob that 
could adjust tests to provide just the right degree of 
incentive to students— enough to elicit their best 
genuine performance — has not been invented. 

Test Use 

One of the most vexing problems in testing policy 
is how to prevent test misuse, principally the 



■'See, for example, LaurenReanick, professor, Univer shy of Pittsburgh, testimony before the UJ. Congress, Senate Committee oa Labor and Human 
Resources, Subcommittee on Education, Aits, and Humanities, Mar. 7, 1990. 

»See, e.g., Daniel Koretz, Robert Linn, Stephen Dunbar, and Lome Shepard, "The Effects of High Stakes Tfestag on Achievement: Preliminary 
Findings About Generalization Across Ifestt," paper presented at the annual meeting of the American Educational Research Association, Chicago, IL, 
April 1991; and Thomas Haladyna, Susan B . Nolan, and Nancy S. Hass, * 'Raising Standardized Achievement Tfcst Scores and the Origins of Ttest Score 
Pollution/ 1 Educational Researcher, vol. 10, No. 5, Jun*July 1991. 

JI See, e g., Steven Brown and Herbert Walberg, University of Illinois at Chicago, "Motivational Effects on Tfest Scores of Elementary School 
Students/' monograph, n.d.; and Paul Burke, M You Can Lead Adolescents to a Tfest But You Can't Make Them Try/ 1 OTA contractor report, Aug. 
14, 1991. 

q 22Sce, e.g., Clart Kerr, "Is Ed?ication Really All TTiat Guilty? ' Zducadon Week, vol. 10, No. 3, ftb. 27, 1991, p. 30. 
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application of a test to purposes for which it was not 
designed. 23 A familiar case of test misuse is the 
ranking of State school systems on a "wall chart* * 
displaying average scores on the Scholastic Aptitude 
Tfest (SAT) along with other data. 24 Why was this a 



case of test misuse? First, the SAT is designed to 
rank applicants from diverse educational back- 
grounds with respect to their likely individual 
performance as college freshmen. It is designed 
specifically to override differences in curricula, 



^See also Burke, op. cit , footnote 2 1 ; Lairy Cuban, ' 'The Misuse of Tfests in Education, ' ' OTA contractor report, Sept 9» 1 99 1 ; Robert L. Linn, • "Ifest 
Misuse: Why is it so Prevalent, M OTA contractor report, September 1991; and Nelson L. Noggle. "The Misuses of Educational Achievement Tests for 
Grades K-12: A Perspective/' OTA contractor report, October 1991. 

^The wall chart, now defunct, was initiated in 1984 by then Secretary of Education Tbrrell Bell. 
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Box C— The Minimum Competency Debate— Continued 

As with every other surge of testing in American education history, 8 MCT was quickly shrouded in 
controversy. Educators and measurement specialists warned against the quick-fix mentality oat exit tests could 
sofre the problems steomimg from a c^^ 

to 

be me latest obstscle to tiie educational s^ 

What luve been tr» effects tfMCTC Theresa 
MCT influenced educatic«, but disagreement over whether tt influenced education for 

Caialknged to show that MCT worked, hs suppocters like to point to trends in achievement test scores: the 
apparent improvement in literacy and rumieracy among studesitsgeneraty 

and minority students, and the upturn in Scholastic Aptitude Test (SAT) scorns that began in 1979. Although MCT 
had ittrtwstdirec* effects on high sch^ 

lower grades too, where students heard the messsge that they would need to work harder in order to be promoted 
att^entually graduate. Thus, they credit MCT even with the upturn in standardized test scores in the elementary 

Other analysts dismiss these conclusions. First, test scores went up even in States without MCT programs, 
uridermming the causal relation between MCT and achievement. 9 Second, even in States with MCT where sores 
did go up, the timing of these events raises impcrtant questions. A 1987 co^ 

of the increase in competency testing occurred . . . several years after the upturn in achievement first became 
apparent in the lower grades. " i0 The report showed that achievement scores ptobsbly began to climb beginning with 
fifth graders in 1975. Thus, unless one is willing to believe that tests can have virtually instantaneous effects on 
achievement, the timing of the tire m scores omiiot be attributed to 

in 1979 reflects the general improvement in performance recorded by that cohort of test takers all through their 
school yean, and not the advent of MCE As one analyst put it: "... the higher scores rolled through the grades 
like a rippling wave as the elementary schoolchildren got older." 11 

Finally, what about the observed improvements in National Assessment of Educational Progress (NAEP) 
scores? Pint, NAEP scores did rise in the 1970s and 1980s, but the rise actually began as early as the 1974 
assessment, well before MCT was in operation in all but one or two States. Second, analysts point out that while 
test performance among Black and Hispanic 17-year-olds improved markedly during ibe 1970s and 1980s, it would 
be misleading to infer that the gap between white and Black students had disappeared: ". . . white students 
constituted the great majority of students in the two highest categories [suggesting] that there is still a substantial 



"Seech. 4. 

9 Sm Gerald Bracey, rejoinder to Barbara Loner, Commentary, vol. 92. No. 2, August 1991. p. 10. 

10 DanM Kortto, Educational Achievement: Explanations and implications of Recent Trends (Washington, DC: Congressional 
Office. August 1987), p. 84. 

"Bracey, op. cit, footnote 9. 



instruction, and academic rigor that may exist in the 
thousands of high schools from which applicants 
have graduated; by design, therefore, it does not 
measure a student's mastery of any given curricu- 
lum, and therefore should not be used to gauge a 
school's effectiveness at delivering its curriculum. 
Second, the SAT is taken only by about one-third of 
all students nationwide (with considerable regional 



variation), so it provides a very inadequate measure 
of the quality of education offered to all the students 
in a State. 25 

There is considerable professional agreement 
about a number of principles of good test develop- 
ment and appropriate test use. The primary vehicle 
for enforcing these principles is self-regulation by 



^For discussion of these and other problems in using the Scholastic Aptitude lest as an indicator of State educational programs, see Cuban, op. cit., 
footnote 23; and Harold Hodgkinson, "Schools are Awful— Aren't They?" Education Week, vol. 11, No. 9, Oct. 30, 1991, p. A32. 
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For example, one study combined analysis of survey data ta^ intensive interviews wn^ 
administrators, and concluded that the teatint reinforced an a 

efforts to upgrade the content of c u.ttkw being delivered to all students. 14 Other studies nave bemoaned the 
narrowing effect that MCT seems to nave had on instructional strategies, content coverage, and course offerings. 15 
Still other studies focus on tie potential^ misleading m 

suggests that improvements on high-stakes tests do not generaltoewetttoo<her measures of achtevem» 
domain;" and studies that focus in particular on teachera in districts with high- 
minimum competency tests, school evaluation tests, or externally developed course-end tests— demonstrate a 
greater influence of testing on curriculum and instruction. 17 

In the end, men, mere appears to be consensus mat innovation in school testing policies can have profound 
effects— the disagreement is over the desirabiUty of tliose effects. Almou^ 

at times even confusing, one thing is clear: test-based accountability is no panacea. Specific proposals for tests 
intended to cstslyze school improvement must be tcnuinized on their individual merits. 



DtltaKappmsHA. 72, No. 2, October 1990, p. 130. ftrdiicttatktt oftnodiJowedtaf icons, leealao John Carroll, "The National Atraamcnti 
in Readily: An We MteerfafftePindbvi?" PHiD*ltaKappan,it*. 6%, No. 6, Bebnaiy 1917, pp. 424430. 

1990. 
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test developers and other trained professionals. 26 
Standards and codes developed by professional 
associations, critical reviews of tests, and individual 
professional codes of ethics all contribute to better 
testing. But. in general, few safeguards exist to 
prevent misuse and misinterpretation of scores, 



especially once they reach the public domain. Many 
professionals in the testing community also believe 
the codes lack enforcement mechanisms. Moreover, 
there has recently been heightened concern among 
test authors and publishers that market forces may 
interfere with good testing practice. As one test 



26An example of self regulation oftun cited in the testing community is a decision taken by the Educational lbs ting Service (BTS) concerning the 
National Ibachers Examination (NTB). which is designed to certify new teachers, When the Governor of Arkansas signed a bill in 1983 requiring teachers 
to pass the test in order to keep their jobs. ETS President Gregory Anrig protested : ' it is morally and educationally wrong to tell someone who has been 
Judged a satisfactory teacher for many years that passing a certain test on a certain day is necessary to keep his or her job. M ETS announced it would 
no longer sell the NTB to States or school boards that used it to determine the futures of practicing teachers. See Edward Fiske, 4 "Ifcst Misuse is Charged/ ' 
The New York Times, Nov. 29. 1983, p. CI; also David Owen, None of the Above (Boston, MA: Houghton Mifflin, 1985), pp. 243-260. 
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author has warned: . . new corporate managers 
. • . [are] rushing to produce tests that will ostensibly 
meet purposes for which the tests have never been 
intended." 27 

New Testing Technologies 

Educators dedicated to the proposition that testing 
can be an integral part of instruction and a tool for 
assessing the full range of knowledge and skills have 
given impetus to new efforts to expand the technolo- 
gies, modes, formats, and content of testing. Tfcst 
developers and educators are experimenting with: 

• performance assessment, a broad category of 
testing methods that require students to create 
answers or products that demonstrate what they 
are learning, and 

• computer and video technologies for develop- 
ing test items, administering tests, and structur- 
ing whole new modes of content and format. 

This section of the summary begins with an 
overview of the characteristics of these new ap- 
proaches to assessment, and then considers their 
potential role in advancing the three basic functions 
of testing. It is important to remembe* that: 

• new assessment methods alone cannot ensure 
consensus on what children Mould learn or the 
levels of skills children should acquire, 

• curriculum goals and standards of student 
achievement need to be determined before 
appropriate assessment methods can be de- 
signed, and 

• new assessment methods alone do not necessar- 
ily equip teachers with the skills necessary to 
change instruction and achieve new auricular 
goals. 

Performance Assessment 

The move toward new methods of student testing 
has been motivated by new understandings of how 
children learn as well as by changing views of 
curriculum. These views of learning, which chal- 
lenge traditional concepts of curricula and teaching, 
also challenge existing methods of evaluating stu- 
dent competence. For example, it is argued that if 
instruction ought to be individualized, adaptive, and 
interactive, then assessment should share these 
characteristics. In general, educators who advocate 




Performance assessment covers a broad range of 
testing methods that require students to create answers or 
products to demonstrate what they are learning. In this art 
assessment, students record their observations as they 
sculpt with clay; the finished product and their notes will 
become part of their portfolio for the year. 

performance assessment believe testing can be made 
an integral and effective part of learning. 

One type of performance assessment uses paper- 
and-pencil methods such as "constructed-response" 
items, for which students produce their own answers 
rather than select from a set of choices. Other 
approaches take performance assessment further 
along the continuum— from short-answers at one 
extreme to live demonstrations of student work at 
the other (see box D). Under ideal circumstances, 
these methods share the following characteristics: 

• they require students to construct responses, 
rather than select from a set of answers; 

• they assess behaviors of interest as directly as 
possible; 

• they are in some cases aimed at assessing group 
performance rather than individual perform- 
ance; 

• they are criterion-referenced, meaning they 
provide a basis for evaluating a student's work 
with reference to criteria for excellence rather 
than with reference to other students 9 work; 

• in general, they focus on the process of problem 
solving rather than just on the end result; 

• carefully trained teachers or other qualified 
judges are involved in most of the evaluation 
and scoring; and 



tfRudman, op. clt., footnote 8, p. 6. 
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Box D— The Many Faces of Performance Assessment 

Performance assessment is a broad term It covers many different types of testing methods that require students 
to demonstrate their competencies or knowledge by creating an answer or product. It is best understood as a 
continuum of formats that range from the simplest student-constructed responses to comprehensive demonstrations 
or collections of large bodies of work over time. This box describes some common forms of performance 
assessment 

Constructed-response questions require students to produce an answer to a question rather man to select from 
an array of possible answers (as multiple-choice items do). In constructed-response items, questions may have just 
one correct answer or may be more open ended, allowing a range of responses. The form can also vary: examples 
include answers supplied by filling in a blank; solving a mathematics problem; writing short answers; completing 
figure! responses (drawing on a figure like a graph, illustration, or diagram); or writing out all the steps in a geometry 
proof. 

Essays have long been used to assess a student's understanding of a subject by having the student write a 
description, analysis, explanation, or summary in one or more paragraphs. Essays are used to demonstrate how well 
a student can use facts in context and structure a coherent discussion. Answering essay questions effectively requires 
analysis, synthesis, and critical thinking. Grading can be systematized by having subject matter specialists develop 
guidelines for responses and set quality standards. Scorers can then compare each student's essays against models 
that represent various levels of quality. 

Writing is the most common subject tested by performance assessment methods. Although multiple-choice 
tests can assess some of the components necessary for good waiting (spelling, grammar, and word usage), having 
students write is considered a more comprehensive method of assessing/composition skills. Writing enables 
students to demonstrate composition skills— inventing, revising, and clearly stating one's ideas to fit the purpose 
and the audience— as well as their knowledge of language, syntax, and grammar. There has been considerable 
research on the standardized and objective scoring of writing assessments. 

Oral discourse was the earliest form of performance assessment Before paper and pencil, chalk, and slate 
became affordable, school children rehearsed their lessons, recited their sums, and rendered their poems and prose 
aloud. At the university level, rhetoric was mterdisciplinary: reading, writing, and speaking were the media of public 
affairs. Today graduate students are tested at the Master's and Ph.D. levels with an oral defense of dissertations. But 
oral interviews can also be used in assessments of young children, where written testing is inappropriate. An obvious 
example of oral assessment is in foreign languages: fluency can only be assessed by hearing the student speak. As 
video and audio make it possible to record performance, the use of oral presentations is likely to expand. 

Exhibitions are designed as comprehensive demonstrations of skills or competence. They often require 
students to produce a demonstration or live performance in class or before other audiences. Teachers or trained 
judges score performance against standards of excellence known to all participants ahead of time. Exhibitions 
require a broad range of competencies, are often mterdisciplinary in focus, and require student initiative and 
creativity. They can take the form of competitions between individual students or groups, or may be collaborative 
projects that students work on over time. 

Experiments are used to test how well a student understands scientific concepts and can carry out scientific 
processes. As educators emphasize increased hands-on laboratory work in the science curriculum, they have 
advocated the development of assessments to test those skills more directly th in conventional paper-and-pencil 
tests. A few States are developing standardized scientific tasks or experiments mat all students must conduct to 
demonstrate understanding and skills. Developing hypotheses, planning and carrying out experiments, writing up 
findings, using the skills of measurement and estimation, and applying knowledge of scientific facts and underlying 
concepts—in a word, "doing science"— are at the heart of these assessment activities. 

Portfolios are usually files or folders that contain collections of a student's work. They furnish a broad portrait 
of individual performance, assembled over time. As students assemble their portfolios, they must evaluate their own 
work, a key feature of performance assessment. Portfolios are most common in writing and language aits— showing 
drafts, revisions, and works in progress. A few States and districts use portfolios for science, mathematics, and the 
arts; others are planning to use them for demonstrations of workplace readiness. 

SOURCE: Office of technology Assessement, 1992. 
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• students understand clearly the criteria on 
which they are judged. 

Computer and Video Technologies 

Data processing technologies have played a 
significiant role in shaping testing as we know it 
today, and could be important tools for the develop- 
ment of innovative tests. Computers have most 
commonly been used for the creation of test items 
and the scoring and reporting of test results. New 
computer and video technologies, however, used 
alone or in conjunction with certain types of 
performance assessment, offer possibilities for en- 
hancing testing in the classroom. As computers have 
become more available in schools, their use for 
testing has become more feasible. Research in tlL* 
field is showing promise in the following areas: 

• questions presented and answered on comput- 
ers can go beyond the traditional multiple- 
choice format, allowing test takers to create 
answers rather than select from alternatives 
presented to them; 

• video, audio, and multimedia can make more 
realistic and engaging questions and tasks 
available; 

• computer-adaptive testing can establish an 
individual test taker's level of skill more 
quickly and, under ideal conditions, more 
accurately than conventional paper-and-pencil 
testing; and 

• integrated learning systems, already found in 
some classrooms, often come with testing 
embedded in the instruction and provide on- 
going analysis of student progress. 

Continued research combining computing power, 
principles of artificial intelligence, learning theory, 
and test design could yield significant advances in 
the form and content of assessment. But a set of 
impressive technological and economic barriers 
need to be surmounted: for example, the limited 
availability (and relatively higher cost) of hardware, 
compared to paper-and-pencil tests, has prevented 
more rapid innovation and adoption. And even with 
more hardware, there is no guarantee that the 
capacity of that hardware will be adequate to meet 
constantly increasing software requirements. An 
even greater barrier is the lack of communication 
between educators, test developers, and technolo- 



gists in achieving a consensus on the goals of testing 
and in shaping a vision for technology in the service 
of those goals. 

Using New Testing Technologies 
Inside Classrooms 

Performance assessment is not new to teachers or 
students; many techniques have long been used by 
teachers as a basis for making judgments about 
student achievement within the classroom. The form 
and complexity can vary: 

• Imagine yourself a rebel at the Boston lea 
Party and write a letter describing what oc- 
curred and why. 

• Complete the following five geometry proofs. 

• Describe both the dramatic and situational 
irony in Dickens' Hard Times, specifically 
using the characters of the Teacher, Mr. Mc- 
Choakumchild, and the boss businessman in 
Coketown, Thomas Gradgrind. 

As illustrated in box E, what students produce in 
response to these testing tasks caa reveal to the 
teacher more than just what facts they have learned; 
they reveal how well the student can put knowledge 
in context. Well-crafted classroom performance 
tasks are useful diagnostic tools that can reveal 
where a student may be having problems with the 
material. They can also help the teacher gauge the 
pacing and level of instruction to student responses. 
At their best, these tasks can be exciting learning 
experiences in themselves, as when a student, 
required to create a product or answer that puts 
knowledge into context, is blessed with that flash of 
inspiration, "Aha! I see how it all comes together 
now!" In addition, these tests can signal to the 
students what skills and content they should learn, 
help teachers adjust instruction, and give students 
clear feedback. 

Much of the research about learning and cognitive 
processes suggest important new possibilities for 
tests than can diagnose a student's strengths and 
weaknesses. Although traditional achievement tests 
have focused largely on subject matter, researchers 
are now recognizing that "... an understanding of 
the learner's cognitive processes— the ways in 
which knowledge is represented, reorganized, and 
used to process new information—is also needed." 28 



a RobertL.Linn > "BairierstoNewTeatDeal^ 1985 ETS Invitational Conference, 

n "~n B. Freeman (ed.) (Princeton, NJ: Educational Testing Servk 1986), p. 73 
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Box E — Mr. Griffith's Class and New Technologies of Testing: Before and After 

lb understand how teaching and testing are traditionally used in the classroom, consider this fictional account 
of a fourth grade teacher's efforts to understand his students* progress, and the role standardized tests play in that 
understanding. We start with mathematics, or, as it is known in most fourth grade classrooms, arithmetic. 

Mr. Griffith is working on fractions. Among the 28 children in his class, 3 raise their band to every one of the 
teacher's prompts, and usually have the answer right. Some of the other children seem to be on safe ground when 
it comes to adding and subtracting £. jctions, but appear puzzled over the rules cf multiplying. Hie majority appear 
lost when it comes to division. Griffith has a sense of these differences based on his constant interaction with his 
class, but he needs more systematic information to know how to adjust his lessons. 

Before 

For starters, Griffith turns to his own tests, which are tightly linked to his >structional objectives and to the 
material he has covered in class. He also assesses the children in other ways: he checks their workbooks, calls on 
them to do problems at the blackboard, poses questions and invites answers, and eavesdrops while his students work 
in small groups. As an experienced teacher, Griffith can synthesize his observations of children at work into fluid 
judgments of their strengths and weaknesses and go mat next vital step of adjusting his pedagogy accordingly. 

An additional source of information is the summary of statistics from last spring's administration of a 
nationally normed standardized mathematics test. From these data, Griffith could get a sense of how well the 
students in his class stack up against others in the school and even in the Nation as a whole, as measured by their 
performance on that test several months earlier. For example, he might find that Sarah and Jonathan, two of the three 
students who seem to know all the answers, scored high on the test But he might also find mat Richard, the third 
one, did less well than his current classroom performance would indicate. (Did he have a bad day in the spring, or 
did he work on his fractions ove; 'he summer?) He might also find that Noreen, another bright child in the class, 
did very well on the test but still gets stuck when she has to perform at the blackboard. 

On the whole, this test data provides information, but probably not enough for Griffith to get a complete picture 
of his students' learning needs or to structure his lesson plans. One problem is that a handful of his students were 
not even present for the spring testing, and he has no test data for them. Another problem is that the standardized 
test scores do not distinguish between fractions and other applications of addition and subtraction. When Griffith 
moves beyond fractions, there is no guarantee mat the next topic on the curriculum will have been covered on the 
standardized test. 

It is not much better with reading and writing. The children read a lot of books on their own, but the reading 
tests supplied by the district still give passages out of context that have no meaning for many of the students. And, 
even though Griffith feels it is important to have his students do as much writing as possible, the tests are mainly 
questions on spelling and vocabulary. If he wants to make the children's scores look good, and the principal happy, 
he has to drill his students a lot on the mechanics. Important as they are, they do not inspire much enthusiasm in 
either the students or, truth be knovn, in Griffith. But scores are important for merit pay in his district, so Griffith 
knows where his priorities should be. 

After 

Consider again the situation of Mr. Griffith, our fourth grade teacher. In the last few years, his school has 
gradually invested in technology. Each class now has several computers linked together in an integrated learning 
system (ELS) that corresponds to the mathematics and language arts curriculum taught in his school. Money from 
the FTA made it possible for Griffith to purchase two additional stand alone computers and a VCR, which connect 
to a television that had been locked in the storage room until a few years back. Occasionally he borrows the school's 
video camera from the library. While he is far from considering himself a "teklde,' ' Griffith took a few courses on 
teaching with computers and has grown pretty comfortable with their use, especially since he knows that his 
colleague, Mrs. luster, a computer whiz, is just across the hall and willing to help him when he gets stuck. 

Mr. Griffith finds mat, as he uses these technologies for teaching, common sense requires mat he use them for 
testing as well. Like the teaching, the testing varies. Some of the testing he does is the same as before, but made 
simpler by the technology. With the help of a testmaker software package, he can design his own short-answer, 
essay, or multiple-choice quizzes geared to the material he has been teaching. He appreciates the fact that the 

Continued on next page 
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Box E— Mr. Griffith's Class and New Technologies of Testing: Before and After— Continued 

software can automatically tr At quarterns into Spanish, so Maria and Esteban, who recently arrived from El 
Salvador, can take teats with me rest of the class. The children say these tests are much easier to read than the 
handwritten ones he had to crank out on the school's ancient mimeograph machine. He keeps better track of their 
records with "gradebook" software mat automatically computes and updates student avenges and lets him know 
who is slipping to tiir* for him to ?et up his little "fireside chats" with students. 

But me real change has been in being able to lbkliis testing ctoser to tte 
having his students do a lot of writing on the word processor. Now he has the students pass their writing around 
on the computer, make comments on each other's works, and save their first drafts. They seem more comfortable 
making revisions, and he can grade ttaal products that are indeed 
written work in electronic portfolios on disk; at ttoeixl of each semester they chose tfo^ 
out for inclusion in the portfolio they take with them to the fifth grade. Some, like Regine, have a hard time deciding 
what is best and why. She'd like to print it all! 

The mathematics they have been working on is included in the software in the IS : same old tractions and long 
division— the material that Griffith has watched, over the years, turn some students off mathematics forever while 
others just breeze through it. But at least now he canget a totter handlem where tte 

Dana is no problem— he has already moved on to two- and three-digit long division. At the end of his work, the 
system prints out a report mat shows he got all 10 problems in the mini-test right, and completed it in 20 minutes. 
Griffith makes a note to himself— ''Move Dana ahead to the next unit on the program and see how he does. It's 
far better than having him staring out the window while I'm going over the basics with the other kids." Michelle, 
who did fine with multiplication, continues to have difficulty in division problems. A quick printout of the problems 
she missed— with the step-by-step procedure she followed— reveals that her problem lies in subtraction— she keeps 
forgetting principles of carrying. "Maybe I can get Brad to work with her on some of those problems," he thinks. 
"Oops, Brad is too much of a tease. Better ask Kevin instead." 

Before it is time for the first grading period, Griffith prints a summary report on all the children's work. There 
is still a huge range in their skills, especially in mathematics. Even with the bells and whistles added in the computer 
programs, the curriculum can still be pretty deadly, Griffith knows. He decides to try using some of the new videos 
Mrs. Juster told him about as ways to get his students more interested in using mathematics to solve problems. "The 
one about the abandoned bell tower at the edge of town, in which the bell starts mysteriously ringing, might get their 
interest," he thinks. They like working in groups and digging out the clues in the video; looking for patterns and 
doing the mathematics to solve the problem might put some of these dry mathematics facts into context. Maybe. 

While they are watching the video, Griffith plans to get Elise, a student who just came into his class yesterday 
from a neighboring school district, started on the computer-adaptive test she will need for placement It looks like 
she is quite far behind the other students; this will give a quick picture of her abilities and can be used in determining 
whether she might benefit from the Chapter 1 program in the school. "Shoot, I hate to have her miss that video, 
though. I suppose I can see if she can stay after school and take the test She'll miss her bus home, though, and I'll 
be late picking up the baby at the day care center. And then there's the video report I promised to help Lindsey , Scott, 
and Sherri with. They are working on a report on 'Why wr need new playground equipment' and interviewing 
students playing in the schoolyard after school. I can see they'll need a lot of help with that! Whoever said 
technology makes teaching easier?" 

SOURCE: Fictional scenario prepared by Office of Technology Assessment, 1992. 



New diagnostic tests, informed by cognitive science 
research, may help teachers recognize more quickly 
the individual learner's difficulties and intervene to 
get the learner back on track. Similarly, computer- 
administered tests open up new possibilities for 



keeping records of a student's errors or ineffective 
problem-solving strategies, and for providing imme- 
dia'e feedback so that children can recognize their 
errors while still involved in thinking about the 
questions. 29 



^See, for example, Isaac Bejar, "Educational Diagnostic Assessment," Journal of Educational Measurement, vol. 21, No. 2, summer 1984, pp. 
175-189. 
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Using New Testing Technologies 
Beyond Classrooms 

Teaching has always been an art more than a 
science, and what works in one classroom with one 
teacher does not easily transfer to other classrooms 
with other teachers." Consequently, many of the 
methods used by teachers to gauge the progress of 
their students and adjust their lessons are not 
standardized. As long as teachers can correct their 
judgments on a continuing and fluid basis, day by 
day and hour by hour, teacher experimentation with 
a wide range of inferential assessment methods 
presents no particular harm and can offer many 
benefits. 

When judgments about student performance are 
moved outside the classroom, however, they must be 
comparable: . . whatever contextual understand- 
ing of their fallibility may have existed in the 
classroom is gone." 31 Using tests fairly and appro- 
priately for management decisions about schools or 
students, therefore, imposes special constraints. As 
explained in detail in chapter 6 of the full report, 
standardization in test administration and scorng is 
the first necessary condition to make test results 
comparable. It is precisely the recognition that 
individual teachers' judgments may be insufficient 
as the basis for crucial decisions affecting children's 
futures that historically has fueled public interest in 
standardized tests originating from outside the 
classroom or school. 32 

It is important to recall that the basic concept ot 
direct assessments of student performance is not 
new. American schools traditionally used oral and 
written examinations to monitor performance. It was 
the pressure to standardize those efforts, coupled 
with the perceived need to test large numbers of 
children, that led eventually to the invention of the 
multiple-choice format as a proxy for genuine 
performance. Evidence that these proxies were more 
efficient in informing administrative decisions rap- 
idly boosted their popularity, despite their less 



obvious relevance to classroom learning. The mod- 
ern performance assessment movement is based on 
the proposition that new testing technologies can be 
more direct, open ended, and educationally relevant 
than conventiaal tests, and also reliable, valid, and 
efficient. 

How can performance assessments and computer- 
based tests contribute to system monitoring and 
selection, placement, and credentialing decisions? A 
growing number of States are experimenting with 
answers to this question. Thirty-six States currently 
use writing assessments and nine others are planning 
to introduce writing assessment in the near future. 
Twenty-one States currently use other performance 
assessment methods including portfolios, constructed 
response, and hands-on demonstrations; 19 States 
plan to adopt some or all of these methods. Figure 3 
shows the current geographic distribution of States 
using writing and other performance assessments. 
Some States are using sampling technologies to 
reduce the direct costs of performance assessments 
and are seeking to resolve various technical prob- 
lems. Most States are using these tests in combina- 
tion with the more familiar multiple -choice test. 

lb the extent that decisions about school re- 
sources could be based on these statewide assess- 
ments, they are potentially high stakes. Advocates 
maintain that performance assessments have a clear 
advantage over standardized multiple-choice tests, 
because they assess a wider range of tasks. Al- 
though these assessments do not necessarily 
provide different estimates of individual student 
progress than some conventional tests, many 
educators believe their advantage lies in their 
more obvious relevance to learning golds. The 
involvement of teachers in developing and scoring 
performance assessments is crucial to keeping them 
closely linked to curricula and instruction. 

Using performance assessments beyond the con- 
fines of classrooms raises a set of important research 
and policy issues: 



"See Richard Mummc ind Richard Nelson, ' 'Production and Innovation When Itechnkjues are Ifecit: The Case of Education, 1 1 Journal of Economic 
Behavior and Organization, vol* 5, 1984, pp. 353*373; also Pauly, op. cit., footnote 12. 

31 Stephen Dunbar, Daniel Koretz, and HJ>. Hoover, "Quality Control in the Development and Use of Performance Assessments/ 1 paper presented 
at the annual meeting of the National Council on Measurement in Education, Chicago, IL, April 1991, p. 1. 

Htf decisions about children's future opportunities are at stake, then the tests must also demonstrate sufficient 1 'predictive validity, " i.e., they must 
provide reasonably accurate information about individual potential for future behavior in school work, or elsewhere For discussion of issues pertaining 
to the use of test scores in predicting future performance, see, e.g., Henry Levin, ' 'Ability Tfests for Job Selection: Are the Economic Claims Justified?" 
listing and the Allocation of Opportunity. B. Oifford (ed.) (Boston, MA: Kluwer, 1990); and James Grouse and Dale Trushdm, The Case Against the 
SAT (Chicago, IL: University of Chicago Press, 1988). 
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Figure 3— Statewide Performance Assessments, 1991 
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mm Writing assessment only (n»15) 

EZZJ Writing tnd other types of performance assessments (n-21) 
None (n«14) 



NOTE: Chart Includes optional programs. 
SOURCE: Office of Technology Assessment, 1992. 

• The most common form of performance assess- 
ment is the evaluation of written work: essays, 
compositions, and creative writing have been 
widely used in large-scale testing programs. 
Other forms of performance assessment are still 
in earlier stages of development and, though 
promising, require considerable experimenta- 
tion before they can be used for high-stakes 
decisions. 

• If performance assessment is to be successfully 
adopted, continuing professional development 
for teachers will be critical. Most teachers 
receive little formal education in assessment. 
Performance assessment may provide a great 
opportunity for teacher development that links 
instruction with assessment. 

• Some parents and educators are worried that a 
move to greater use of performance assessment 
could have a negative impact on minority 
groups. It is critical that the issues of cultural 
influence and bias be scrutinized in all aspects 



of performance assessment: selection of tasks, 
administration, and scoring. 

• Administration and scoring of performance 
assessment are both time consuming and labor 
intensive. If the time spent on testing is viewed 
as integral to instruction, however, new meth- 
ods could be cost-effective. 

Computer technologies, too, may play a powerful 
role in system monitoring and high-stakes testing of 
individual students. In particular: 

• Adaptive testing, in which the computer selects 
questions based on individual students 9 re- 
sponses to prior questions, can provide more 
accurate data than conventional tests, and in 
less time. 33 

• Advances in software could make possible 
automated scoring that closely resembles human 
scoring. 

• Large item banks made possible by advanced 
storage technologies could lower the costs of 
test development by allowing State or district 



33 Por discussion of the state-of-the-art in computer-adaptive testing, see Bert F. Green, The Johns Hopkins University, ' 'Computer-Based Adaptive 
q Hating in 1991," monograph, May 9, 1991. 

M. 34 
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testing authorities to tap into common pools of 
questions or tasks. 
• With the combination of large item banks, 
computer-adaptive software, and computerized 
test administration, tests would no longer need 
to be composed in advance and printed on 
paper; rather, each student sitting at a terminal 
could theoretically face a completely individu- 
alized test* This could reduce die need for tigM 
test security, given that most students cannt. 
memorize die many thousands of items stored 
in item banks. 

An important policy question regarding comput- 
ers in testing is whether to invest in new technolo- 
gies for scanning hand-written responses to open- 
ended test items. Since more tests may one day be 
administered by computer, investing in new scan- 
ning technologies could be wasteful. 

Special Considerations for System Monitoring 

Performance assessments and computer-based 
tests comd be designed to provide information on the 
effectiveness of schools and school systems. As with 
all tests, though, the outcomes of these new tests 
need to be interpreted judiciously: the relative 
performance of schools or school systems must be 
viewed in die context of many factors that can 
influence achievement. 

Because individual student scores are not neces- 
sary for system monitoring, innovative sampling 
methods can be used that offer many important 
advantages for implementing performance assess- 
ments. When sampling is used, inferences can be 
made about a school system based on testing either 
a representative subsample of students or by giving 
each student only a sample of all the testing tasks. 
These methods can lessen considerably the direct 
costs of using long and labor-intensive performance 
tasks, allow broader coverage of the content areas 
that appear on the test, and still keep testing time 
limited. Furthermore, sampling methods provide 
important protection against misuse of a test for 
other functions (such as selection, placement, or 
certification), since students do not receive individ- 
ual scores. 

However, the use of sampling methods raises 
specific concerns: one issue is whether students 9 less 
obvious incentives to do well on such tests— given 
that no individual consequences are attached to 

performance— could lead to erroneously low esti- 

o 

ERLC 




Photo cndt:BM Corp. 



Computers can change testing just as they change 
learning. Recent advances In computers, video, and 
related technologies could one day revolutionize testing. 

mates of aggregate achievement. A related issue is 
whether tests administered to samples of students 
will effectively signal to all students what they are 
expected to learn. A third question is whether it 
would be fair to administer new testing methods, 
intended as tools for enriched instruction, to samples 
of students rather than to all students. 

These issues warrant further research as a prereq- 
uisite to using new testing methods for system 
monitoring functions. 

Special Considerations for Selection, 
Placement, and Credentialing 

New testing technologies have considerable po- 
tential to enrich selection and certification decisions. 
For example, portfolios of student work can provide 
richly detailed information about progress and 
achievement over time that seems particularly rele- 
vant and useful for certification decisions. One 
example is the Advanced Placement (AP) studio art 
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examination, administered by the Educational Tfest- 
ing Service (ETS), which is based on a portfolio of 
student artwork. This examination is used to award 
college credit, and, as such, certifies that a student 
has mastered the skills expected of a first-year 
college student in studio art. 

Tfcsts based on complex computer simulations of 
"on-the-job" settings are being developed for 
architecture, medicine, and other professions, as a 
basis for professional licensing and certification; the 
integration of graphics, video, and simulation tech- 
niques can create tests more closely resembling the 
actual tasks demanded by those professions. Al- 
though promising, these initial efforts have uncov- 
ered some technical issues that will require consider- 
ably more research before the tests can accurately 
and fairly assess the skills of interest, and be used to 
make high-stages decisions about individuals. 34 

OTA has identified the following central policy 
issues concerning the design of new tests for 
selection, placement, and certification. 

Technical requirements— These tests must meet 
very high technical standards. Inferences drawn 
from than must be based on rigorous standards of 
empirical evidence not necessarily required of tests 
used for other functions. Because tests used to select, 
place, or certify individuals can have potentially 
long term and significant consequences, their uses 
need to be limited to the specific functions for which 
they are designed and validated. Similarly, because 
test scores are only estimates, very high levels of 
reliability, or consistency, must be demonstrated for 
the test as a whole. Finally, because of the amount of 
day-to-day variability in individuals, no one test 
score should be used alone to make important 
decisions about individuals. 35 



Generalizability — Another issue pertains to the 
content coverage of new assessment formats, such as 
exhibitions, portfolios, science experiments, or com- 
puter simulations. The advantage of these formats is 
in their coverage of relevant factors of performance 
and achievement; however, this usually means that 
only a few such long and complex tasks can be 
completed by a single child in the allotted time. 36 
Are inferences about achievement made on the basis 
of just a few tasks generalizable across the whole 
domain of achievement? When each child can 
complete only a few tasks, there is a much higher 
risk that a child's score will be specific to that 
particular task. Selection and certification decisions 
cannot be made on the basis of these tasks unless 
results are stable and generalizable. 

Security— Currently most high-stakes selection, 
placement, or certification tests are multiple-choice, 
and precautions are taken to keep items secret. 
Scores would be suspect if some (or all) test takus 
knew the items in advance. 37 Given the relatively 
low number of performance-based tasks that might 
appear on some new tests, sharing of information 
from one cohort of test takers to another could 
become a problem undermining the test's validity. 
Computers with enough memory to accommodate 
very large item banks may provide some technologi- 
cal relief, although the question remains open as to 
whether a sufficient number of different items could 
be written at reasonable cost. 

Fairness — Most previous legal challenges have 
targeted tests used to make significant decisions 
about individuals. Any test designed for selection, 
placement, or certification will be carefully scruti- 
nized by those concerned with equity and bias. 
Designing a performance-based selection or certifi- 
cation test will require considerable research to 
ensure elimination of bias. 



"See, for example, David B. Swanson, John J. Norcini, and Louis J. Grosso. "Assessment of Clinical Competence: Written and Computer-Based 
Simulations," Assessment and Evaluation in Higher Education, vol 12, No. 3, 1987* pp. 220-246. 

35 An additional reason for insisting on high standards is that high-stakes tests can lead inadvertently to the labeling of individuals— by themselves 
or by others— with uncertain and potentially harmful consequences. For discussion of these issues see, e.g., U.S. Congress, Office of Technology 
Assessment, ' 'The Use of Integrity Tfests for Pre-Employment Screening/ 1 background paper of the Science, Education, and Transportation Program, 
September 1990. 

^Increasing the time allotted to assessment 
But completely "seamless" integration of testing and instruction could raise problems of its own, such as potential infringement of students' rights to 
know whether they are being tested and for what purposes. 

37 Tlie concept of • 'test openness" is controversial. Most traditional measurement experts argue that allowing students access to test items in advance 
would irreparably compromise the test's validity. For opposing viewpoints, however, see, e.g., Judah Schwartz and Katherine A* Viator (eds.), The Price 
of Secrecy: The Social, Intellectual and Psychological Costs of Current Assessment Practice, A Report to the Ford Foundation (Cambridge, MA: 
Harvard Graduate School of Education, September 1990); and John Frederickson and Alan Collins, "A Systems Approach to Educational Tbsting," 
* J ucational Researcher, vol. 18, No. 9, December 1989, pp. 27-32* 
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Cost Considerations: A Framework 
for Analysis 

A common challenge posed to advocates of 
alternative assessment methods is an economic one: 
can they be administered and scored as efficiently as 
conventional standardized tests? 38 Indeed, one of the 
attractive features of commercially published stand- 
ardized tests is their apparently low cost. As shown 
in box F, OTA estimated outlays for standardized 
testing in a large urban school district were; approxi- 
mately $1.6 million for 1990-91 ($0.8 million per 
test administration), or only about $6 per student per 
test administration. 

But these outlays on contracted materials and 
services and district testing personnel do not tell the 
whole story. First, they neglect the dollar value of 
teacher time devoted to test administration. Because 
a teacher's many activities are not typically itemized 
on a school district budget, the costs associated with 
teacher time spent administering tests are less 
obvious than other testing expenses. But they can be 
significant: in the district studied by OTA, the 
portion of total teacher salaries attributable to time 
spent administering tests was roughly $1.8 million 
per test, or $13 per pupil. 

Another important component of cost is the time 
spent by teachers in test preparation. This factor is 
more variable than administration time and is more 
difficult to estimate. It depends largely on the degree 
to which teachers can distinguish their regular 
instruction from classroom work that is driven by the 
need to prepare students for specific tests. The 
question is whether the test preparation activities 
would take place even in the absence: of testing: this 
issue hinges partly on test content— how closely 
does the test reflect curricular and instructional 
objectives? — and partly on how individual teachers 
allocate their classroom time across various activi- 
ties, including test-related instruction. (Tests that are 
intended to be linked to instruction might not be 
perceived as such by some teachers, and tests that are 
apparently separate from regular instruction could 
be useful tools in the hands of other teachers.) In the 



district OTA studied, teachers reported spending 
anywhere from 0 to 3 weeks in preparing their 
students for each test administration — at a cost as 
high as $13.5 million per test, or close to $100 per 
pupil. 39 

Just as counting material and testing personnel 
outlays alone can lead to deceptively low estimates 
of the total resources devoted to testing, accounting 
fully for teacher administration and preparation time 
can lead to deceptively high cost estimates. To 
correctly account for teacher time requires attention 
to the indirect or opportunity costs of that time. An 
opportunity cost is defined generally as **. . . the 
value of foregone alternative action. ' ,4 ° With respect 
to testing, analysis of opportunity costs focuses 
attention on the following question: to what extent 
does the time spent by teachers on preparation and 
administration of tests contribute to the core class- 
room activities of teaching and learning? 

If testing is considered integral to instruction, then 
teacher time spent on preparing students and on 
administering the tests has lower opportunity costs 
than if the testing has little or no instructional value. 
To estimate the opportunity costs, then, requires 
information or assumptions about the degree to 
which any particular test is intended as an instruc- 
tional tool, and information or assumptions about 
the extent to which individual teachers use testing as 
part of their instructional program. 

As shown in box F, some teachers in the district 
OTA studied spent as much as 3 weeks preparing 
students for each of the two standardized tests, plus 
4 days administering each test. The worst case would 
be one in which this time was completely irrelevant 
to coursework: the district would have incurred 
steep opportunity costs— about $15 million per test, 
or close to $ 1 10 per pupil. The best case, in which all 
preparation time was relevant to coursework, would 
have cost under $2 million per test, or $13 per pupil. 

Thus, the total costs of a testing program consist 
of both direct and opportunity components: direct 
expenditures on materials, services, and salaries, and 



"The efficiency advantages of standardized multiple-choice tests are discussed in several places in this report See especially ch. 4 for a historical 
synopsis, ch. 6 for general discussion of item formats, and ch. 8 for review of technological change in test scoring and administration. 

*A full accounting of direct costs would also include overhead on the school building and grounds, i.e., depreciation ^oibutable to time spent on 
test preparation and administration, lb simplify the analysis, OTA omitted this element 

q "'David W. Pearce (ed.), The MIT Dictionary of Modern Economics, 3rd ed. (Cambridge, MA: MTT Press, 1986), p. 310. 
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Box F — Costs of Standardized Testing in a Urge Urban School District 

Because testing policy decisions are still primarily made at the local and State levels, OTA has analyzed the 
kind of data on standardized testing costs that school authorities would likely include in their deliberations over 
testing reform. Data for this illustrative example were provided by the director of Testing and Evaluation in a lane 

waT$l 2WUon tcacbers « ^eluding regular classroom and special teachers. The total 1990-91 district budget 

Approximately 140,000 students in grades kindergarten through 12 take tests, once a year in kindergarten and 
twice a year (fall and spring) in all other grata (absenteeism and student mobility account for the large number of 
untested students). During each test administration, students take separate tests in English, mathematics, social 
studies, and science. The tests typically consist of norm-referenced questions supplemented with locally developed 
cntenon-referenced items. (In kindergarten, first, second, and third grades, criterion-referenced checklists filledout 
by teachers supplement the paper-and-pencil tests.) The tests are machine scored by the test publisher, who provides 
computer-generated score reports to district personnel. 

Tests are administered by 4,500 regular classroom 1aDl( ' PI— Outtaya on Materia*, Services, 

teachers; mere are no other special personnel involved, 
except for a small group of district staff who design the 
criterion-referenced items, manage die overall testing 
program, and conduct research based on test results. 

Alma. \ the district purchases tests from a large 
commercial publishing company mat has many school 
districts as customers, the cost figures discussed below 
are not necessarily representative of other school 
districts in the United States. 

Materials and Services 

In most years, the district purchases only a limited 
supply of test booklets, replacing the complete set only 
once every few years when they become damaged or 
when test items are revised. OTA computed average 
annual expenditures on test booklets based on test 
publishers' estimates that booklets are recycled typi- 
cally once every 7 years. As shown in table Fl, total 
annual outlays for the standardized testing program in 
1990-91— including materials, contracted scoring and 
reporting services, and nonteaching personnel— were 
ajiproximately $1.6 mfflion, or $5.70 per student per 
test administration. 1 

Teacher Time 

Based on the specified time allotments for the 
various tests in the various grades, and on conversa- 
tions with district staff, OTA found that full-time 
teachers in the district spend roughly 2 percent of their 
annual work time in the administration of tests to 
students. The total salary cost to the district for teacher 
time spent administering tests was roughly $3.6 
million for two testing administrations ($1.8 million 
per testing cycle). 



Materials 

Contracted: 

lest booklets: new purchases phis annualized 

costs based on assumed 7-year cycle $369,000 

Practice books 40,400 

Examiner manuals 28,200 

Checklists and worksheets 1 00,600 

Kindergarten program 33,300 

Othm. 

Kindergarten Chapter 1 tests $3,000 

Labeto 1,200 

•• 17,900 

Answer sheets 23,000 

Headers... 2,700 

Language battery 1300 

SpecWtests 14,100 

Materials subtotal $641,700 

Services 

Contracted: 

Scoring $ 1 7 5( eoo 

Report generation 141|8 oo 

Collection 14(W0 

Spanning 146 ,500 

Distribution g ( ooo 

Services subtotal... $487,700 

Nonteaching pmonneh 

Asslstantdlreetor $58,200 

Research manager 56,500 

Research associates (2) 108,700 

Research assistants (3) 1 27,800 

Secretaries 55,500 

®**» 45,600 

Nonteaching personnel subtotal $453,300 

Tot>l S1.882.7O0 

SOURCE: Office of Technology AiMwmml, based on iata supplied by a 
large urban school district. 180041 academic year. 



..«„ JS^?^T ^ kt * 2* rf ,Uad * rdized tettin * ""nf*™ with others, OTA looked at cost data from (he November 1988, 
Survey of Hating Practices and Issues." conducted by the National Association of lest Directors (NATO). The survey was seat to testint 
directors in approximately 123 school districts. For 38 districts providing their cost information, the avenge direct cost per student was $480 
per year, slightly lower than the $5.70 per student in this example. Most of the districts responding to the NATO survey administer achievement 
tests only once a year, compared to OTA's example district, which tests twice a year in grades 1 to 12. 
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In cwmmtom with district teachen, OIA T^F2-~8«)aryCosttof %MmwTlmil1)Miton 
found that the time theytpead in Okmoom prepare- TUtlnfl, ptr tm AdnHnlttrttion" 

Hon of students for the standardiaed test* varies from 
0to3weeto]mte«inf«lnifois^^ 
claim they spend no time doing test preparation that h 
distinguishable from thoir regular classroom instnic- 
tioo; others use the standaidiaed test as a final 



lestnt 


ImWrtratton* Ikstprsoftrttton 




$1.8 in 


IWon Owaate 0 

1.S waste*' P& mWton 
3wMks: t1&6mUNon 


$1.8 iriNNon 
18*8 ntBBon 



in^lass review time. OIA theiefbfe estimated die ™^^*™«™^^<«™^"*«*» 

salary costs for piepaiationtime under three scenarios: JB^M4iodtMoh«r«. 

0, 1.5, and 3 weeks (per test). These estimates are IISL^-*^-... 

sununari^dintableR. ^^.XS^S^t;^^- 

Total Direct Costs 

The total direct costs of testing can be compute J by adding the 
costs of teacher time for test preparation and admiii^ 
not account for the degree to which teadier time spent on testing is com 

part of regular instruction. The importance of indirect or opportunity costs as it pertains to the analy sis of testing 
costs is illustrated in box O. 



indirect costs of time spent on testing activities. 41 
For a graphical exposition of this concept, see box G. 



Federal Policy Concerns 

Several proposals now pending before Congress 
could fundamentally alter testing in the United 
States. Three issuec already on Congress' agenda are 
proposals for national testing, changes to the Na- 
tional Assessment of Educational Progress (NAEP), 
and revisions to the program that assists education- 
ally disadvantaged children (Chapter 1). Federal 
action could also focus on ensuring the appropriate 
use of tests, and speeding research and development 
on testing. 

These policy opportunities combined with the 
current national desire to improve schooling provide 
Congress with an opportunity to form comprehen- 
sive, coordinated, and far-reaching test policy. 
Rather than allowing test activity to occur haphaz- 
ardly in response to other objectives, decision- 
makers can bring these several concerns together in 
support of better learning. 



National Testing 

As discussed in chapter 3 of the full report, the 
past year has witnessed a flurry of proposals to 
establish a system of national tests in eler. jntary and 
secondary schools. Momentum for these efforts has 
built rapidly, fueled by numerous governmental and 
commission reports or the state of the economy and 
die educational system; by the National Goals 
initiative of the President and Governors; by casual 
references to the superiority of examination systems 
in other countries (see box H); and most recently by 
the President's "America 2000" plan. 

The use of tests as a tool of education policy is 
fraught with uncertainties. The first responsibility of 
Congress is to clarify exactly what objectives are 
attached to the various proposals for national 
testing, and how instruments will be designed, 
piloted, and implemented to meet these objectives. 
The following questions warrant careful attention: 

• If tests are to be somehow associated with 
national standards of achievement, who will 
participate in setting these standards? Will the 
content and grading standards be visible or 
invisible? Will the examination questions be 



4, In addition to teacher time, there are opportunity costs associated with student time: assuming that instructional time is an investment with economic 
returns, student time spent on testing can be valued in terms of foregone future income. This follows a 1 'human capital* 1 investment model of education 
See, e.g., Gary P xker, Human Capital, 2nd ed. (New York, NY: National Bureau of Economic Research, 1975). For application of the concept of indirect 
costs to educational testing see also Walter Haney, George Madaus, and Robert Lyons, Boston College, "The Fractured Marketplace for Standardized 
Ifesting," unpublished manuscript, September 1989. 
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Box G — Direct and Opportunity Costs of 
Testing 



Total costs » 
ODjgtumty 



.Tooting 
option 1 



-Totting 
option 2 



Tlmo apont on tooting: 
preparation and administration 



This figure illustrates the relationship between 
time spent on testing activity and the total costs of 
testing. Hypothetical test 1 is assumed to contribute 
little to classroom learning. It costs little in direct 
doUaroufoys,buiUdearmopporajiu^ 
costo begin relatively low bttiis* 
devoted by teachers and students to activities that 
take them away from instruction. 

Hypothetical test 2, which is a nseM instruction 
and learning tool, requires relatively high direct 
expenditure*. But the opportunity costs of time 
devoted to testing are relatively low. 

At point A, a school district would be indifferent 
between die two testing programs, if cost was the 
main consideration. 

SOURCE: Office of Ibchoology Assosamaa t, 1992. 
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kept secret or will they be disclosed after the 
test? 

• If the objective of the test is motivational, i.e., 
to induce students and teachers to work harder, 
then (he test is likely to be high stakes. What 
will happen to students who score low? What 
resources will be provided for students who do 
not test well? What inferences will be made 
about students, teachers, and schools on the 
basis of test results? What additional factors 
will be considered in explaining test score 
differences? Finally, will the tests focus the 
attention of students and teachers on broad 
domains of knowledge, as desired, or on 
narrower subsets of knowledge covered by the 
tests, as often happens? 

• If the Nation is interested in using tests to 
improve the qualifications of the American 
work force, how will valuable nonacademic 



skills be assessed? What should be the balance 
of emphasis between basic skill mastery and 
higher order thinking skills? 

• If there is impatience to produce a test quickly, 
it is likely to result in a paper-and-pencil 
machine-scorable test. What signal will this 
give to schools concerning the need to teach all 
students broader communication and problem- 
solving skills? 

• What effects will national tests have on current 
State and local efforts to develop alternative 
assessment methods and to align their tests 
more closely with local educational goals? 

• Would the national examinations be adminis- 
tered at a single setting or whenever students 
feel they are ready? 

• Would students have a chance to retake an 
examination to do better? 

• Would the tests be administered to samples of 
students or all students? 

• At what ages would students be tested? 

• What legal challenges might be raised? 

If a test or examination system is placed into 
service at the national level before these impor- 
tant questions are answered, it could easily 
become a barrier to many of the educational 
reforms that have been set into motion, and could 
become the next object of concern and frustration 
within the American school system. 

Given that a national testing program could be 
undertaken through State and/or private sector 
initiatives, the role of Congress is not yet entirely 
clear. However, to the extent that congressional 
action regarding NAEP, Chapter 1, and appropriate 
test use will affect the need for and impact of any 
national examinations, Congress has a strong inter- 
est in clarifying the purposes and anticipated conse- 
quences of such examinations. Also, Congress must 
carefully analyze the pressures the national test 
movement is exerting on these programs, such as the 
idea of converting NAEP into a national test for all 
students. 

Future of the National Assessment of 
Educational Progress 

NAEP has proven to be a valuable tool to track 
and understand educational progress in the United 
States. It was created in 1969 and is the only 
regularly conducted national survey of educational 
achievement at the elementary, middle, and high 

40 
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Box H — National Testing: Lessons From Overseas 1 

The American educational system has a traditional commitment to pluralism in the definition and control of 
ttintaJbuweUasthefsirpfevista 

examination systems, which have historically been geared principally toward selection, placement, and 
ciedentialing, need to be considered Judiciously. OTA finds that the following factors should be considered when 
comparing examination systems overseas with those in the United States: 

• Examination systems in almost every industrialized country are in flux. Changes over die past three decades 
have been quite radical in several countries. Nevertheless, there is still a relatively greater emphasis on tests 
used for selection, placement, and certification than in the United States. 

• None of the countries studisd by OTA has a single, centrally prescribed examination that is used for all 
purposes— classroom diagnosis, selection, and school accountability. Most examinations overseas are used 
todky for certifying and sorting individual students, not for school or system accountability. Accountability 
in European countries is typically handled by a system of inspectors charged with overseeing school and 
examination quality. Some countries occasionally test samples of students to gauge nationwide 
achievement. 

• External examinations before age 16 have all but disappeared from the countries in the European 
community. Primary certificates used to select students for secondly schools have been dropped as 
comprehensive education past the primary level has become available to all students. 

• The United States is unique in the extensive use of standardized tests for young children. Current proposals 
for testing all American elementary school children with a commonly administered and graded examination 
would make the United States the only industrialized country to adopt this practice. 

• There is great variation in the degree of central control over curriculum and testing in foreign countries. In 
some countries centrally prescribed curricula are used as a basis for required examinations (e.g., Franc ,, 
Italy, the Netherlands, Portugal, Sweden, Israel, Japan, China and, most recently, the United Kingdom). 
Other countries are more like the United States in the autonomy of States, provinces, or districts in setting 
curriculum and testing requirements (Australia, Canada, Germany, India, and Switzerland). 

• Whether centrally developed or not, the examinations taken during and at the end of secondary school in 
other countries are not the same for all students. Syllabi in European countries determine subject-matter 

tail draws on information from George Madam, Boston College and Thomas Kellaghan, St Patricks College, Dublin, "Student 
Examination Systems In the European Community: Lessons for the United States, ' ' OTA contractor report, June 1991 . 

Continued on next page 



school levels. It was designed to be an educational 
indicator, a barometer of the Nation's elementary 
and secondary educational condition. NAEP reports 
group data only, not individual scores. 

NAEP has also been an exemplary model of 
careful and innovative test design. As discussed in 
chapter 3 of the full report, NAEP has made 
pioneering contributions to test development and 
practice: "matrix" sampling methods, broad-based 
processes for building consensus about educational 
goals, an emphasis on content-referenced testing, 
and the use of various types of open-ended items in 
large-scale testing. 

If Congress wishes to develop a new national 
test-— to be administered to each child and used as 
a basis for important decisions about children 
and schools— OTA concludes that NAEP is not 
§ n nropriate. This objective would require funda- 



mental redesign and validation of NAEP, and would 
alter the character and value of NAEP as the 
Nation's independent gauge of educational progress. 
It would also greatly increase both the cost and time 
devoted to NAEP at every level. 

A better course for Congress is to retain and 
strengthen NAEP's role as a national indicator of 
educational progress. To do this, Congress could: 

• require NAEP to include more innovative items 
and tasks that go beyond multiple choice; 

• fund the development of a clearinghouse for the 
sharing of NAEP data, results of field trials, 
statistical results, and testing techniques, giv- 
ing States and local districts involved in the 
design of new tests better access to the lessons 
from NAEP; 

• restore funding for NAEP testing in more 
subject areas, such as the fine arts; 
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Box H— National Wing: Lessons From Overseas— Continued 

content and examinations are based on them, organized in terms of traditional subject areas (language, 
mamernartea, sciences, hiatwy^ 

(general or specialized). Even to European Community (EC) countries wfah a national system, the 

examinations am du1wiiwam& att smdentt 

cxamiiu^onsmayalsobedin^re^ 

am high-level, low-level, and various auricular options). 

• With differentiated examinations, multiple options give students on lower tracks the chance to choose lower 
level examinations. It appears, though, that these school-leaving examinations can discourage students who 
do not expect to do wen from staying in school. 

• In no other system do commercial test publishers play as central a role as they do in the United States. In 
EC and other industrialized nations, tests am typically established, tested, and scored by ministries of 
education, wim somelocalde^ 

fractionally been dominated by and oriented toward the universities. In Europe, most examination systems 
are organized around a system of school inspectors, with quasi-governmental control through the 
establishment of local boards, or multiple boards in larger countries. 

• Psychometricsdomnmplayasi^ 
countries. Ahheugh issues of fairness and com 
die United States. 

• Teachers in other countries have considerable responsibility for administering and scoring examinations. In 
some countries (Germany, the U.S.S JL , and Sweden) they even grade their own students. Teacher contracts 
often include the expectation that they will develop or score examinations; they am sometimes offered extra 
summer pay to read examinations. 

• Syllabi, topics, and even sample questions are widely publicized in advance of examinations, and it is not 
. considered wrong to prepare explicitly for examinations. Annual publication of past examinations strongly 

influences instruction and learning. 

• In European countries, the dominant form of examination is "essay on demand." these examinations 
require students to write essays of varying lengths in responses to short-answer or open-ended questions. 
Use of multiple-choice examinations is limited, except in Japan, where they are as prevalent as in the United 
States. Oral examinations are still common in some of the German lander and in foreign language testing 
in many countries. Performance assessments of other kinds (demonstrations and portfolios) are used for 
internal classroom assessment 



• support the continued development of methods 
to communicate NAEP results to school offi- 
cials and the general public in accurate and 
innovative ways (particular emphasis could be 
placed on informing the public about appropri- 
ate ways to interpret and understand such test 
data and on minimizing misinterpretation by 
the press and general public); 

• add testing of nonacademic skills and knowl- 
edge relevant to the world of work; 

• restore funding for the assessment of out-of- 
school youth at ages 13 and 17, to provide a 
better picture of the knowledge and skills of an 
entire age cohort; 



• request data on the issues surrounding test- 
takers' motivation to do well on NAEP in 
various grades; 42 

• ex;»?id NAEP to assess knowledge in the adult 
nonschool population; and 

• ensure that matrix sampling is retained, to 
minimize both costs and time requirements of 
NAEP. 

An experiment in extending the uses of NAEP to 
provide data on educational progress at the State 
level and to measure this progress against national 
standards is now under way. 

OTA has identified three potential problems of 
using NAEP for State-by-State comparisons that 



42 In particular, questions have been raised about the accuracy of information derived from tests of 12lh graders who are about to graduate. Further 
^ — lal efforts and research could shed light on this issue. Ed Roeber, Michigan Educational itos«sment Program, personal communication. October 1991. 
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The National Assessment of Educational Progress (NAEP) 
has pioneered the use of performance assessments in 
large-scale testing programs. In this science task, 7th and 
1 1 th grade students figure out which of the three materials 
would make the box weigh the most 

Congress should review before making a final 
decision on a permanent use of NAEP for this 
purpose. First, States could be pressured to introduce 
curriculum changes to improve their NAEP per- 
formance on certain subjects, regardless of whether 
such changes have educational merit. For example, 
following the release in 1991 of the State-by-State 
results from the first such trial, some States (e.g., the 
District of Columbia) announced plans to revamp 
their mathematics curricula. It could be argued that 
the use of NAEP as a prod to State education 
authorities to rethink their curricula is a good thing; 
however, it is clear that the pressure to perform on 
the test can outweigh the stimulus for careful 



deliberation about academic policy, and that many 
States could make changes for the sake of higher 
scores rather than improved learning opportunities 
for children. This signifies putting the cart of testing 
before the horse of curriculum, exactly the kind of 
outcome feared by the original designers of NAEP 
who insisted that scores not be reported below broad 
regional levels of aggregation. 

Second, the presentation of comparative scores 
could lead to intensified school-bashing — even when 
differences in average State performance are statisti- 
cally insignificant or when those differences reflect 
variables for beyond the control of school authori- 
ties. Critics of comparative NAEP reporting point 
out that low-scoring States need real help— finan- 
cial, organizational, and educational— not just more 
testing and public humiliation. 

Finally, extending NAEP to State-level analysis 
and reporting is a costly undertaking. NAEP funding 
jumped from $9 million in 1989 to $19 million in 
1991 . It is not clear that this extra money provides a 
proportional amount of useful information: one 
researcher interested in this question showed that 
roughly 90 percent of the variance in average State 
performance on NAEP could be explained by 
socioeconomic and demographic variables already 
available from other data. 43 In a time of scarce 
educational resources, NAEP extensions need to be 
weighed carefully on the scale of anticipated bene- 
fits per dollar. State-by-State comparisons of NAEP 
performance may not pass this cost-benefit test. 44 

These issues notwithstanding, many education 
policymakers at the State and national levels have 
insisted that State-level NAEP could provide new 
and useful information to support curricular and 
instructional reform. Their arguments should be 
taken as potentially fruitful research hypotheses and 
treated as such: just as new medical treatments 
undergo careful experimentation and evaluation 
before gaining approval for general public use, 
extensions and revisions to NAEP should be post- 
poned pending analysis of research data. 

In education, the line between research and 
implementation is often blurred; few newspapers 
noted that the 1990 State mathematics results were 
the first in a "trial" program— the results were 



ttSee Richaid Wolf, Teachers College, Columbia University, ' 'What Can We Learn From Slate NAEP?" unpublished document, n.d. 

"See also Daniel Koietz, ' 'State Comparison Using NAEP: Large Costs, Disappointing Benefits," Educational Researcher, vol. 20, No. 3, April 
,0 ™ pp. 19-21. 
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treated as factual evidence of relative effectiveness 
of State education systems. 

The NAEP standard-setting process also raises 
questions of feasibility and desirability. As dis- 
cussed in chapter 6 of the full report, the translation 
of broad educational goals— -such as emphasizing 
problem-solving skills in the mathematics curricu- 
lum—into specific test scores is a complex and 
time-consuming task. The particular performance 
standards selected must be validated empirically: 
how closely educators in different parts of the 
country will concur on standards of proficiency for 
children at different stages of schooling is not 
known. Standard setting has always been a slippery 
process— in employment, psychological, or educa- 
tional testing— in large part because of difficulties 
surrounding the designation of acceptable "cutoff 
scores." Not surprisingly, controversy surrounded 
the initial attempts to reach consensus on standards 
for NAUP, with experts disagreeing among them- 
selves on key definitions and interpretations of 
items. 

Educators and policymakers continue to debate 
whether nationwide standards are desirable, espe- 
cially if children who do not reach the defined 
standards are somehow penalized. In addition to the 
potential effects on children, turning NAEP into a 
higher stakes test— with implicit and explicit re- 
wards pegged to achievement of the given profi- 
ciency standards — could irreparably undermine 
NAEP's capacity as a neutral barometer of educa- 
tional progress. 

While continued research on State-by-State 
NAEP and on standard setting will be useful, 
Congress needs to find ways to ensure that data 
from this research are reported as such and that 
the results are not prematurely construed as 
conclusive. 

Chapter 1 Accountability 

Because of its scope and influence, Chapter 1 
represents a powerful lever by which the Federal 
Government affects testing practices in the United 
States. OTA's analysis of Chapter 1 testing and 
evaluation requirements (see ch. 3 in the full report) 
suggests several congressional policy options that 
could improve Chapter 1 accountability while re- 
ducing the overall testing burden in tta United 
o States. 
ERIC 



Chapter 1, the largest Federal program of aid to 
elementary and secondary education, provides sup- 
plementary education services for disadvantaged 
children. Over its 25-year history, Chapter 1 evalua- 
tion and assessment requirements have been revised 
many times. The result is an elaborate web of legal 
and regulatory requirements with standardized norm- 
referenced achievement tests as the basic thread. The 
tests fulfill several functions: Federal policymakers 
and program administrators use nationally aggre- 
gated scores to judge the program's overall effec- 
tiveness; and local school districts and States use 
scores to determine which schools are not making 
sufficient progress in their Chapter 1 programs, to 
place children in the program, to assess children's 
educational needs, and for other purposes. 

As a result of the 1988 amendments to Chapter 1, 
which introduced the "program improvement" 
concept, Chapter 1 testing became even more 
critical. At the national level, there has been growing 
concern mat the aggregated test data— collected by 
school districts with widely divergent expertise in 
evaluation— do not provide an accurate and well- 
rounded portrait of the program's overall effective- 
ness. At the school district level, educators argue 
that the test data often target the \.rx>ng schools for 
program improvement or miss the schools with the 
weakest programs in the district or the subject areas 
and grade levels most in need of help. At die 
classroom level, teachers tend to feel that their own 
tests and assessments, as well as some externally 
designed criterion-referenced tests, afford a much 
better picture of individual students' progress than 
do the norm-referenced tests. ' 

Congr ess' principal challenge vis-d-vis Chapter 
1 is to find ways to separate Federal evaluation 
needs from State and local needs. It is a tough 
dilemma: to balance the national desire for meaning- 
ful and comparable program accountability data 
against State and local needs for useful information 
on which to base instructional and programmatic 
decisions. Congress will consider reauthorization of 
Chapter 1 in 1993. Hearings and analysis on these 
complex questions in 1992 would provide an excel- 
lent basis for a major revision of the evaluation and 
testing requirements. 

One way to improve Chapter 1 accountability 
is to create a system that separates national 
evaluation needs from State and local informa- 
tion needs. It is the perceived need for nationally 

44 
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aggregated data that drives the use of norm- 
referenced tests. If Congress separated national 
evaluation purposes from State and local purposes 
and articulated different requirements for each, State 
Education Agencies (SEAs) and local education 
authorities would be free to use a variety of 
assessment methods that better reflect their own 
localized Chapter 1 goals. The national data would 
be used to give Federal policymakers, taxpayers, and 
other interested groups a national picture of Chapter 
1 effectiveness, while the State and local informa- 
tion would be used in modifying programs, placing 
students, targeting schools for program improve- 
ment, deciding on continuation of schoolwide proj- 
ects, and other purposes. 

Congress could obtain national data on Chapter 1 
through a well-constructed, periodic testing of 
Chapter 1 children, similar to the way NAEP is used 
to assess the progress of all students. This assess- 
ment would rely on sampling (rather than testing of 
every student) and could be administered less 
frequently than the current tests. In addition to 
relieving the testing burden on individual students 
and reducing the time devoted to testing by teachers, 
principals, and other school personnel, this proce- 
dure could also result in higher quality data. As the 
principal client of the data, the Federal Government 
could identify the areas to be assessed, instill greater 
standardization and rigor in test administration and 
data analysis, and avoid the aggregation problems 
that arise from thousands of school districts admini- 
stering different instruments under divergent condi- 
tions. This type of Federal assessment could be 
designed and administered by either an independent 
body or the Department of Education, with the help 
of the Chapter 1 Technical Assistance Centers. 

The system might be designed to provide a menu 
of assessment options— criterion-referenced tests, 
reading inventories, directed writing, portfolios, and 
other performance assessments — from which States 
could establish statewide evaluation criteria for 
Chapter 1 programs. If Congress preferred maxi- 
mum local flexibility, the discretion to choose 
among the assessment options could be left to school 
districts, as long as they administered the instru- 
ments uniformly and consistently across schools. 
The Chapter 1 Technical Assistance Centers could 
help the States and school districts select and 
implement appropriate measures, 
o 
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Either a State or local option would increase the 
latitude for linking assessments to specific program 
goals. However, if States or districts were to select 
instruments that put their Chapter 1 programs in the 
best light, the information could be misleading. 
Congress should take steps to see mat this does not 
happen. For example, a strict approach would 
require programs to ahow growth in student achieve- 
ment using multiple indicators, perhaps including 
one indicator based on a standardized test. A looser 
version of this option would allow States or districts 
to develop their own evaluation methods, and set 
their own standards of acceptable progress, subject 
to Department of Education approval. 

An advantage of separating evaluation require- 
ments would likely be local development of new 
testing methods, which have not been widely used in 
Chapter 1 because of the need for national aggrega- 
tion and comparability. Congress could encourage 
this choice by reserving some of the Federal Chapter 
1 evaluation and research funding to advance the 
state of the art 

For example, competitive grants could be author- 
ized for local education agencies, SEAs, institutions 
of higher education, Technical Assistance Centers, 
and other public and private nonprofit agencies to 
work on issues such as calibrating alternative 
assessments, training people to use them, bringing 
down the cost, and making them more objective. 
Congress could also consider allowing funds from 
the 5-percent local innovation set-aside to be used 
for local development and experimentation. 

Since Chapter 1 is a major national influence on 
the amount, frequency, and types of standardized 
testing, a broad research and development effort for 
Chapter 1 alternative assessment would have an 
impact far beyond Chapter 1. The instruments, 
procedures, and standards developed by this type of 
effort would spill over into other areas of education, 
such as early childhood assessment, and would 
increase local districts' experimentation in other 
components of their educational programs. 

An important issue for congressional considera- 
tion ii the appropriate grade levels for Chapter 1 
evaluations. There is considerable agreement that 
testing of children in the early grades is inappropri- 
ate, especially if standardized norm-referenced paper- 
and-pencil tests are used; the 1988 reauthorization 
eliminated testing requirements for children in 
kindergarten and first grade. On the other hand, there 
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are compelling arguments that from a program 
evaluation point of view it is important to have 
"pre" and "post" data, which means collecting 
some baseline information. Lack of a reliable 
method to demonstrate progress during the early 
years could discourage principals from channeling 
Chapter 1 funds to very young children, despite 
evidence that early intervention is very effective. If 
testing is required to show progress, these tests 
should be developmentally appropriate. 49 

A related congressional issue concerns the assess- 
ment of school children who have only been in a 
given school's Chapter 1 program for a short period 
of time; school districts throughout the country cite 
the high mobility of Chapter 1 children as a logistical 
obstacle to meaningful evaluation. Despite regula- 
tory guidance, confusion continues to reign in State 
and local Chapter 1 offices about how to deal with 
a mobile student population. Clear and consistent 
policies regarding testing of these children would 
alleviate some of that confusion. 

Appropriate Test Use 

The ways tests should be used and the types of 
inferences mat can appropriately be drawn from 
them are often not well understood by policymakers, 
school administrators, teachers, or other consumers 
of test information. Perhaps most important, many 
parents and test takers themselves are often at a loss 
to understand the reasons for testing, the importance 
of the consequences, or the meaning of the results. 
School policies about how test scores will be used 
are important not only to students and parents but 
also to teachers and other school personnel whose 
own careers may be influenced by the test perform- 
ance of their pupils. Many of these problems result 
from using tests for purposes for which they are not 
designed or adequately validated. Fairness, due 
process, privacy, and disclosure issues will continue 
to fuel public passions around testing. 

As reviewed in chapter 2 of the full report, 
attempts to develop ethical and technical standards 
for tests and testing practices have a long history. 
The most recent attempt to codify standards for fair 



testing practice (in the Code of Fair Testing Prac- 
tices in Education? 6 led to a set of principles with 
which most professional testing groups concur. 

Educational testing practices in some areas have 
been defined by Federal legislation. In the mid- 
1970s, Congress passed laws with significant provi- 
sions regarding testing, one affecting all students 
and parents and the others affecting individuals with 
disabilities and their parents. In both cases this 
Federal legislation has had far-reaching implications 
for school policy, because Federal financial assist- 
ance to schools has been tied to mandated testing 
practices. The Family Education Rights and Privacy 
Act of 1974 — commonly called the "Buckley Amend- 
ment" after former New York Senator James 
Buckley—was enacted in part to attempt to safe- 
guard parents' rights and to correct some of the 
improprieties in the collection and maintenance of 
pupil records. The basic provisions of this legisla- 
tion established the right of parents to inspect school 
records and protected the confidentiality of informa- 
tion by limiting access to school records (including 
test scores) to those who have legitimate educational 
needs for the information and by requiring parental 
written consent for the release of identifiable data. 

Given the growing importance of testing and 
the precedent for Federal action, several avenues 
are open if Congress wishes to foster better 
educational testing practices and appropriate test 
use throughout the Nation. 

One option for congressional action would aim at 
unproved disclosure of information. Individual 
rights could be better safeguarded by encouraging 
test users (policymakers and schools) to do a careful 
job of informing test takers. Many critical decisions 
about test use, such as the selection and interpreta- 
tion of tests, are made in a professional arena that is 
well-protected from open, public scrutiny. This 
occurs in part because of the highly technical nature 
of testing design. Although the professional testing 
community is not unanimous about what constitutes 
good testing practice, there is considerable consen- 
sus on the importance of carefully informing indi- 
vidual test takers (and their parents or guardians in 



"See, e.g., Robert B. Sitvin and Nency A. Madden, Center for Research on Effective Schooling for Disadvantaged Students, The Johna Hopkins 
University, "Chapter 1 Program Improvement Guidelines: Do They Reward Appropriate Practices?" paper prepared for the Office of Educational 
Research and Improvement, U.S. Department of Education, December 1990. See also Nancy Kober, "The Role and Impact of Chapter 1 ESBA, 
Evaluation and Assessment Practices," OTA contractor report, June 1991. 

"Joint Committee on Testing Practices, Code of Fair Testing Practices in Education (Washington, DC: National Council on Measurement in 
Education, 1988). 
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the case of minors) about the purpose of the test, the 
uses to which it will be put, the persons who will 
have access to the scores, and the rights of the test 
taker to retake or challenge test results. 47 

Congress could require, or encourage, school 
districts to: 

• develop and publish a testing policy that spells 
out the types of tests given, how they are 
chosen, and how the tests and test scores will be 
used; and 

• notify parents of test requirements and conse- 
quences, with special emphasis on tests used 
for selection, placement, or credentialing deci- 
sions. 

A second approach for Congress is to encourage 
good testing practice by modeling and demonstrat- 
ing such practice at the Federal level. The Federal 
Government writes much legislation that incorpo- 
rates standard i zed testing as one component of a 
larger program. For example, the Individuals With 
Disabilities Education Act (Public Law 101-476), 
formerly the Education for all Handicapped Chil- 
dren Act of 1975 (Public Law 94- 142), was designed 
to assure the rights of individuals with disabilities to 
the best possible education; this legislation included 
a number of explicit provisions regarding how tests 
should be used to implement this program. 

Among the provisions were: 1) decisions about 
students are to be based on more than performance 
on a single test, 2) tests must be validated for the 
purpose for which they are used, 3) children must be 
assessed in all areas related to a specific or suspected 
disability, and 4) evaluations should be made by a 
multidisciplinary team. 

Through these assessment provisions, Public 
Laws 101-476 and 94-142 have provided a number 
of significant safeguards against the simplistic or 
capricious use of test scores in making educational 
decisions. Congress could adopt similar provisions 
in other legislation that has implications for testing. 
A recent example of Federal legislation that could 
lead to questionable uses of tests is a provision in the 
1990 Omnibus Budget Reconciliation Act. The 



objective of mis provision is to reduce the high loan 
default rate of students attending postsecondary 
training programs (largely but not exclusively in 
proprietary technical schools). The policy lever is 
testing: the act requires students without a high 
school diploma to pass an "ability-to-benefit" test, 
on the assumption that students who are able to 
benefit from postsecondary training will be more 
likely to get jobs and pay back their loans than 
students who are not able to benefit. Basic questions 
arise about the appropriateness of using existing 
tests to sort individuals on this broad "ability" 
criterion. Even the most prevalent college admis- 
sions tests do not make claims of being able to 
predict which students will "benefit" in the long 
run, but rather which students will do well in their 
freshman year. 

A third course of action would focus on various 
proposals to certify, regulate, oversee, or audit tests. 
If Congress wants to play a more forceful role in 
preventing misuse of tests—in particular, preventing 
tests designed for classroom use or system monitor- 
ing from being applied to individual selection or 
certification decisions— this option is the clear 
choice. If testing continues to increase and takes on 
even more consequences, pressure for congressional 
intervention will grow. Proposals include Federal 
guidelines for educational test use, labeling of all 
mandated tests and test requirements, labeling of all 
commercially available tests, and creating a govern- 
mental or quasi-governmental entity to regulate, 
certify, and disseminate information about tests. 
This last option, which echoes a concept endorsed by 
the National Commission on Testing and Public 
Policy, has been discussed in testing policy circles 
for some years now. 48 

Finally, Congress could pursue more indirect 
ways to inform and educate consumers and users of 
tests. This might include supporting continuing 
professional education for teachers and administra- 
tors, or funding the development of better ways to 
analyze test data and convey the results more 
effectively to the public. 



♦toe, for ample. American Psychological Association. Standards fir Educational and Psychological listing (Washington, DC: 1985); Joint 
Committee on letting Practices, op. cit, footnote 46; and Russell Sage Foundation, Guidelines tor the Collection, Maintenance, and Dissemination of 
Pupa Records (New York, NY: 1969). especially Guideline 1.3. 

<*See, e.g., D. Goslin, "The Preset* and Future of Assessmmt Towards an Agenda foe Research and Public Policy, " draft report of a planning meeting 
"«~««oied by the U.S. Department of Education, Mar. 23-25, 1990, draft dated July 19, 1990. 
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Federal Research and Development Options 

Test development is a costly process. Even for a 
test or test battery that has already been in use for 
many years , it can take from 6 to 8 years to write new 
items, pilot test, and validate a major revision. 49 
Most investigators working on new testing designs 
are wading into uncharted statistical and methodo- 
logical waters. For a new test, consisting of open- 
ended performance tasks or other innovative items, 
development and validation are substantially more 
expensive, even if test content and objectives are 
clearly defined. For example, the development of a 
set of new performance measures assessing specific 
job-related skills for the armed services cost $30 
million over 10 years. The results of this sustained 
research effort, coordinated by the Department of 
Defense and carried out by the individual service 
organizations, were a set of hands-on measures, new 
supervisory ratings, job-knowledge tests, and com- 
puter-based simulations representing the skills re- 
quired in some 30 well-defined jobs. The main 
purpose of the research was to improve the outcome 
or criterion measures used to validate the Armed 
Services Vocational Aptitude Battery, (he standard- 
ized test used to qualify new recruits for various job 
assignments. 30 

In elementary and secondary school testing, 
however, the first step-defining the content that 
tests should cover — is much more complex than 
defining specific job performance outcomes for a 
number of jobs. The omnipresent issue of achieving 
consensus on content poses formidable barriers to 
test design. Even in a subject like mathematics, for 
which there is some agreement on outcomes and 
standards (as exemplified by the National Council 
on Teachers of Mathematics* recent work on stand- 
ards for mathematics education), the definition of 
those standards took 6 years to develop. In most 
other subjects a nsensus on goals and curricula is 
more difficult to reach, adding substantially to 
research and development (R&D) costs. Moreover, 
separate standards, content, and tests would need to 
be developed for each grade level and subject to be 
tested. 



Another factor making testing R&D expensive is 
the question of how new assessment methods will 
affect students and teachers. Much of the interest in 
developing new assessments (see ch. 6 in the full 
report) stems from the desire to see those assess- 
ments eventually become the basis for system 
monitoring and other high-stakes decisions. Mtfida- 
tion studies are therefore critical. Random assign- 
ment experiments, which are cosily, could encounter 
legal barriers because students* lives and educa- 
tional experiences could be affected. \&lidation 
studies, therefore, may need to be conducted with 
quasi-experimental designs, which suffer from vari- 
ous statistical and methodological problems. 31 

Congress has an important role to play in 
supporting R&D in educational testing, because 
adequate funding cannot be expected from other 
sources. Commercial vendors are not likely to make 
the requisite investments without some assurance of 
a reasonable return; they face strong market incen- 
tives to sell generic products that match the curricula 
of many school systems. But if these products are so 
general in their coverage that they reflect only a 
limited subset of skills common to virtually all 
curricula, schools may not see the advantage of 
adding them to an already strapped instructional 
materials budget. States might be willing to foot the 
R&D bill, although their education budgets are 
generally quite constrained. Moreover, in addition to 
costs associated with consensus-building on test 
content and evaluation of the anticipated effects of 
testing, new performance assessment and/or com- 
puter-based methods require basic research on 
learning and cognition. Basic education research has 
traditionally been a Federal responsibility. 

The question becomes how much: how much 
should the Federal Government spend on educa- 
tional testing R&D? The answer depends on the 
choice Congress makes regarding the value of 
dramatically enlarging the currently available range 
of testing methods. For example, Federal spending 
on educational assessment research is roughly $7 
million for fiscal year 1992, out of a total education 
research budget of close to $100 million. 32 This 



«Rudman, op. cit„ footnote 8, p. 8. 

»SeeAkundi»Wigdoe«idB«awea(ed^ 1 (Waihingtcm, DC: National Academy Preu, 1W1J. 

"see. eg., Ana&d Deal, "TbchniaU Iwuet in Meuunng Scholastic Improvement Due to Compcoietory Education Programi." Socioeconomic 
Z»to«WsW«fl«'.voi24.No.2.1990,pp.l43-153. 

^Education research and lUtiitki ipendlng in fiical year 1990 waa $94 million. See U.S. Department of Education, Digest ofEducaHonalStatistics, 

E R]C 1990, **' clt " foottoto *' p ' 344, 
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money is divided almost evenly among NAEP (for 
validation studies, evaluation of trial State assess- 
ment, and secondary data analysis); development of 
new mathematics and science assessments ($6 
million over 3 years, administered through the 
National Science Foundation); and general assess- 
ment research (through the Center for Research on 
Evaluation, Standards, and Student Ifcsting). 

Substantially more funding would be needed if 
Congress chooses to support: 

• cognitive science research on learning and 
testing, 

• development of new approaches to consensus 
building for test content and objectives, 

• research on the generalizability of new testing 
methods across subjects and grades, and 

• validation studies of new testing methods. 



An intermediate funding approach would be to 
target Federal dollars toward: 

• the creation of a clearinghouse to facilitate 
continuing and more widespread dissemination 
of testing research results and innovations, 

• continuing professional education for teachers 
in the applications of new testing and assess- 
ment methods and in the appropriate interpreta- 
tions and uses of test results, and 

• the creation of a nationwide computer-based 
clearinghouse of test items from which States 
and local districts could draw to develop their 
own customized tests. 
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Appendix A 

Contractor Reports 



Copies of contractor reports done for this project are available through the National Technical Information Service 
(NT1S), either by mail (U.S. Department of Commerce, National Technical Infer/nation Service, Springfield, VA 22161) 
or by calling NTTS directly at (703) 487-4650. 

Douglas A. Archbald, University of Delaware, and Arnold C. Porter, University of Wisconsin, Madison, "A 
Retrospective and an Analysis of Roles of Mandated Testing in Education Reform, ,, PB 92-127596. 

C.V. Bunderson, J.B. Olson, and A Oreenberg, The Institute for Computer Uses in Education, "Computers in 
Educational Assessment: An Opportunity to Restructure Educational Practice,' ' PB 92-127604. 

Paul Buike, "You Can Lead Adolescents to a Test But You Can't Make Them Try," PB 92-127638. 

Center for Children and Technology, Bank Street College, "Applications in Educational Assessment: Future 
Technologies," PB 92-127588. 

Nancy Kober, "The Role and Impact of Chapter 1, ESEA, Evaluation and Assessment Practices," PB 
92-127646. 

George F. Madaus, Boston College, and Thomas Kellaghan, St. Patricks College, Dublin, "Examination 
Systems in the European Community: Implications for a National Examination System in the United States," 
PB 92-127570. 

Gail R. Meister, Research for Better Schools, "Assessment in Programs for Disadvantaged Students: Lessons 
From Accelerated Schools," PB 92-127612. 

Ruth Mitchell and Amy Stempel, Council for Basic Education, "Six Case Studies of Performance 
Assessment," PB 92-127620. 

Misuse of Tests, PB 92-127653 

1. Larry Cuban, Stanford University, "The Misuse of Tests in Education." 

2. Robert L. Linn, University of Colorado at Boulder, "Test Misuse: Why Is It So Prevalent?" 

3. Nelson L. Noggle, Centers for the Advancement of Educational Practices, "The Misuse of Educational 
Achievement Tests for Grades K-12: A Perspective." 

A copy of the contractor report listed below may be obtained by writing to the SET Program, Office of Technology 
Assessment, U.S. Congress, Washington, DC 20510-8025; or by calling (202) 228-6920. 

George F. Madaus, Boston College, and Thomas Kellaghan, St. Patricks College, Dublin, "Student 
Examination Systems in the European Community: Lessons for the United States." 
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Office of Technology Assessment 



The Office of Technology Assessment (OTA) was created in 1972 as an 
analytical arm of Congress. OTA's basic function is to help legislative policy- 
makers anticipate and plan for the consequences of technological changes and 
to examine the many ways, expected and unexpected, in which technology 
affects people's lives. The assessment of technology calls for exploration of 
the physical, biological, economic, social, and political impacts that can result 
from applications of scientific knowledge. OTA provides Congress with in- 
dependent and timely information about the potential effects— both benefi- 
cial and harmful— of technological applications. 

Requests for studies are made by chairmen of standing committees of the 
House of Representatives or Senate; by the Technology Assessment Board, 
the governing body of OTA; or by the Director of OTA in consultation with 
the Board. 

The Technology Assessment Board is composed of six members of the 
House, six members of the Senate, and the OTA Director, who is a non- 
voting member. 

OTA has studies under way in nine program areas: energy and materi- 
als; industry, technology, and employment; international security and com- 
merce; biological applications; food and renewable resources; health; 
telecommunication and computing technologies; oceans and environment; 
and science, education, and transportation. 
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